R. M. Hristev
TheANN
ANNBook
Book
The
ANN
Book
The
ANN
Book
The
The ANN Book |
R. M. Hristev | Edition 1
(supercede the draft (edition 0) named \Arti cial Neural Networks")
Copyright 1998 by R. M. Hristev
This book is released under the GNU Public License, ver. 2 (the copyleft license). Basically
\...the GNU General Public License is intended to guarantee your freedom to share and
change free software|to make sure the software is free for all its users...". It also means
that, as is free of charge, there is no warranty.
Preface
➧ About This Book
In recent years arti cial neural networks (ANN) have emerged as a mature and viable
framework with many applications in various areas. ANN are mostly applicable wherever
some hard to de ne (exactly) patterns have to be dealt with. \Patterns" are taken here in
the broadest sense, applications and models have been developed from speech recognition to
(stock)market time series prediction with almost anything in between and new ones appear
at a very fast pace.
However, to be able to (correctly) apply this technology it is not enough just to throw some
data at it randomly and wait to see what happens next. At least some understanding of the
underlying mechanism is required in order to make ecient use of it.
Please note that this book is released in electronic format (LATEX) and under the
GNU Public Licence (ver. 2) which allow for free copying and redistribution as long
as you do not restrict the same rights for others. See the licence terms included
in le \LICENCE.txt". A freeware version of LATEX is available for almost any type of
computer/OS combination and is practically guaranteed to produce the same high quality
typesetting output. On Internet the URL where you can nd the latest edition/version of
this book is \ftp://ftp.funet.fi/pub/sci/neural/books/". Note that you may nd
two les there: one being this book in Postscript format and the other containing the source
les, the source les contain the LATEX les as well as some additional programs | e.g. a program showing an animated learning process into a Kohonen network. The programs used in
this book were developed mostly under Scilab, available under a very generous licence (basically: free and with source code included) from \http://www-rocq.inria.fr/scilab/".
SciLab is very similar to Octave and Matlab. Octave is also released under the GNU licence,
so it's free.
This book make an attempt to cover some of the basic ANN development: some theories,
principles and ANN architectures which have found a way into the mainstream.
First part covers some of the most widely used ANN architectures. New ones or variants
appear at a fast rate so it is not possible to cover them all, but these are among the few ones
with wide applications. This part would be of use as an introduction and for those who have
to implement them but do not have to worry about their applications (e.g. programmers
required to implement a particular ANN engine for some applications | but note that some
important algorithmic improvements are explained in the second part).
Second part takes a deeper insight at the fundamentals as well as establishing the most
important theoretical results. It also describes some algorithmic optimizations/variants for
i
ii
Preface
ANN simulators which require a more advanced math apparatus. It is important for those
who want to develop applications using ANN. As ANN have been revealed to be statistical by
their nature it requires some basic knowledge of statistical methods. An appendix containing
a small introduction to statistics (very bare but essential for those who did not studied
statistics) have been developed.
Third part is reserved to topics which are very recent developments and usually open-ended.
For each section (chapter, sub-section, e.t.c.) there is a special footnote designed in particular to refer some bibliographic information. These footnotes are marked with the section
number (e.g. 2.3.1 for sub-section numbered 2.3.1) or with a number and an \.*" for chapters (e.g. 3.* for the third chapter). To avoid an ugly appearance they are hidden from the
section's title. Follow them for further references.
The appendix(es) contains also some information which have not been deemed appropriate
for the main text (e.g. some useful mathematical results).
The next section describes the notational system used in this book.
➧ Mathematical Notations and Conventions
❖
marginal note
The following notational system will be used (hopefully into a consistent manner) trough
the whole book. There will be two kind of notations: one which will be described here and
most of the time will not be explained in the text again; the other ones will be local (to the
chapter, section, e.t.c.) and will appear also in the marginal notes in the place where they
are de ned/used rst, marked with the symbol ❖ like the one appearing here.
So, when you encounter a symbol you don't know what it is: rst look in this section, if
is not here follow the marginal notes upstream from the point where you encountered it
till you nd it and there should be its de nition (you should not go beyond the current
chapter).
Proofs are typeset in a smaller (8 pt.) font size and refer to previous formulas when not
explicitly speci ed. The reader may skip them, however following them will enhance its
skills in mathematical methods used in this eld.
Do not worry about (fully) understanding all notations de ned here, right now. Return here
when the text will send you (automatically) back.
***
ANN involves heavy manipulations of vectors and matrices. A vector will be often represented by a column matrix:
0x 1
1
x=B
@ ... CA
xN
;
and in text will be often represented by its transposed xT = x1 xN (for aesthetic
reasons and readability). Also it will be represented by lowercase bold letters.
A scalar product between two vectors may be represented by a product between the corre-
Preface
iii
spondent matrices:
xy =
X
i
xi yi = xT y
The other matrices will be represented by uppercase letters. A inverse of a matrix will be
marked by ();1 .
There is an important distinction between scalar and vectorial functions. When a scalar
function is applied to a vector or matrix it means in fact that is applied to each element in
turn, i.e.
0 f (x ) 1
1
f (x) = B
@ ... CA ; f : R
f (xN )
!
R
is a scalar function and application to a vector is just a convenient notation, while:
0 g (x) 1
1
B
g(x) = @ ... C
A ;
g : RN
!
RK
gK (x)
is a vectorial function and generally K = N . Note that bold letters are used for vectorial
6
functions.
One operator which will be used is the ":". The A(i; :) notation will represent row i of
matrix A, while A(:; j ) will stand for column j of the same matrix.
Another operation used will be the Hadamard product, i.e. the element-wise product
between matrices (or vectors) which will be marked with . The terms and result have
to have same dimensions (number of rows and columns) and the elements of result are the
product of the corresponding elements of terms, i.e.
0a
11
A=B
@ ...
ak 1
...
a1n
akn
1
0b
CA ; B = B@ 11...
0a b
11 11
C =A B=B
@ ...
ak 1 b k 1
ANN
()
:
()T
n
bk1
...
b1n
bkn
a1n b1n
akn bkn
...
1
CA
)
1
CA
acronym for Arti cial Neural Network(s).
Hadamard product of matrices (see above for de nition).
a convenient notation for () () () n
| {z }
n
matrix \scissors" operator: A(i : j; k : `) selects a submatrix from matrix
A, made from rows i to j and columns k to `. A(i; :) represents row i
while A(:; k) represents column k.
transposed of matrix ().
❖
:
Hadamard
❖
iv
Preface
( )C
jj
kk
hi
Ef j g
f g
Vf g
f
sign(x)
e
1
1b
e
0
0b
I
i
x
x
j
y
y
z
k
z
k
t
t
i ji
w ;w
complement of vector (). it involves swapping 0 $ 1 for binary vectors
and ;1 $ +1 for bipolar vectors.
module of () (absolute value), when applied to a matrix is an elementwise operation (by contrast to the norm operation).
norm of a vector or matrix | the actual de nition may di er depending
of the metric used.
mean value of a variable.
expectation of event f (mean value), given event g. As f and g are
usually functions then Eff jgg is a functional.
variance of f . As f is usually a function then Vff g is a functional.
8
>
if x > 0
<1
the sign function, de ned as sign(x) = >0
if x = 0 .
:
;1 if x < 0
In case of a matrix, it applies to each element individually.
a matrix having all elements equal to 1, its dimensions will be always
such that the mathematical operations in which is involved are correct.
a (column) vector having all elements equal to 1, its dimensions will be
always such that the mathematical operations in which is involved are
correct.
a matrix having all elements equal to 0, its dimensions will be always
such that the mathematical operations in which is involved are correct.
a (column) vector having all elements equal to 0, its dimensions will be
always such that the mathematical operations in which is involved are
correct.
the unit square matrix, assumed always to have the correct dimensions
for the operations in which is involved.
component i of the input vector.
;
the input vector: xT = x1 xN
output of the output neuron j .
;
the output vector of output layer: yT = y1 yK
output of a hidden neuron k.
;
the output vector of a hidden layer: zT = z1 zM
component k of the target pattern.
the target vector | desired output corresponding to input x.
wi the weight associated with i-th input of a neuron; wji the weight
associated with connection to neuron j , from neuron i.
Preface
v
a
j
prev.
f
0
E
k
C
k)
P (X` )
P (Ck ; X` )
P (C
` k)
P (X jC
k `)
P (C jX
p
N
w1
1
CA, note that all weights
W
all
a
f
w11
.. . .
..
.
.
.
wK 1
wKN
associated with a particular neuron j (j = 1; K ) are on the same row.
total
to a neuron j , the weighted sum of its inputs, e.g. aj =
P w input
ji xi | for a neuron receiving input x, wji being the weights.
i
the vector containing total inputs aj for all neurons in a same layer,
usually a = W z , where z is the output of previous layer.
activation function of the neuron; the neuron output is f (aj ) and the
output of current layer is z = f (a).
the derivative of the activation function f .
the error function.
class k.
prior probability of a pattern x to belong to class k.
distribution probability of a pattern x to be in pattern subspace X` .
join probability of a pattern x to belong to class k and pattern subspace
X` .
class-conditional probability of a pattern x to belonging to class k to be
in pattern subspace X` .
posterior probability of a pattern x to belong to class k when is from
subspace X` .
probability density.
the weight matrix
W
0
B
= @
prev.
Note also that wherever possible will try to reserve index i for input components, j for hidden
neurons, k for output neurons and p for training patterns with P as the total number of
(lerning) patterns.
Ryurick M. Hristev
Contents
Preface
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mathematical Notations and Conventions . . . . . . . . . . . . . . . . . . . . .
i
i
ii
I ANN Architectures
1
1 Basic Neuronal Dynamics
3
2 The Backpropagation Network
9
1.1 Simple Neurons and Networks . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Neurons as Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Common Signal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Network Structure . . . . . . . . .
2.2 Network Dynamics . . . . . . . . .
2.2.1 Neuron Output Function .
2.2.2 Network Running Function
2.2.3 Network Learning Function
2.2.4 Initialization and Stop . . .
2.3 The Algorithm . . . . . . . . . . .
2.4 Bias . . . . . . . . . . . . . . . .
2.5 Algorithm Enhancements . . . . .
2.5.1 Momentum . . . . . . . .
2.5.2 Adaptive Backpropagation
2.5.3 SuperSAB . . . . . . . . .
2.6 Applications . . . . . . . . . . . .
2.6.1 Identity Mapping Network
2.6.2 The Encoder . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
5
6
9
10
10
11
11
15
16
18
20
20
21
23
24
24
26
viii
CONTENTS
3 The SOM/Kohonen Network
3.1 Network Structure . . . . . . . . . . . . . . . . . .
3.2 Types of Neuronal Learning . . . . . . . . . . . . .
3.2.1 The Learning Process . . . . . . . . . . . .
3.2.2 The Trivial Equation . . . . . . . . . . . .
3.2.3 The Simple Equation . . . . . . . . . . . .
3.2.4 The Riccati Equation . . . . . . . . . . . .
3.2.5 More General Equations . . . . . . . . . . .
3.3 Network Dynamics . . . . . . . . . . . . . . . . . .
3.3.1 Network Running Function . . . . . . . . .
3.3.2 Network learning function . . . . . . . . . .
3.3.3 Initialization and Stop condition . . . . . .
3.3.4 Remarks . . . . . . . . . . . . . . . . . . .
3.4 The algorithm . . . . . . . . . . . . . . . . . . . .
3.5 Applications . . . . . . . . . . . . . . . . . . . . .
3.5.1 The Trivial Model with Forgetting Function
3.5.2 Square mapping . . . . . . . . . . . . . . .
4 The BAM/Hop eld Memory
4.1 Associative Memory . . . . . . . . . . . . .
4.2 The BAM Architecture . . . . . . . . . . .
4.3 BAM Dynamics . . . . . . . . . . . . . . .
4.3.1 Network Running . . . . . . . . . .
4.3.2 The BAM Energy Function . . . . .
4.4 The BAM Algorithm . . . . . . . . . . . .
4.5 The Hop eld Memory . . . . . . . . . . . .
4.5.1 The Discrete Hop eld Memory . . .
4.5.2 The Continuous Hop eld Memory .
4.6 Applications . . . . . . . . . . . . . . . . .
4.6.1 The Traveling Salesperson Problem .
5 The Counterpropagation Network
5.1 The CPN Architecture . .
5.1.1 The Input Layer .
5.1.2 The Hidden Layer
5.1.3 The Output Layer
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
31
31
32
32
33
35
39
39
39
40
40
40
41
41
44
47
47
48
49
49
51
53
54
54
57
60
60
67
67
68
71
76
CONTENTS
5.2 CPN Dynamics . . . . . .
5.2.1 Network Running .
5.2.2 Network Learning .
5.3 The Algorithm . . . . . . .
5.4 Applications . . . . . . . .
5.4.1 Letter classi cation
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Adaptive Resonance Theory (ART)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1 The ART1 Architecture . . . . . . . . .
6.2 ART1 Dynamics . . . . . . . . . . . . .
6.2.1 The F1 layer . . . . . . . . . . .
6.2.2 The F2 layer . . . . . . . . . . .
6.2.3 Learning on F1 : The W weights
6.2.4 Learning on F2 : The W weights
6.2.5 Subpatterns . . . . . . . . . . .
6.2.6 The Reset Unit . . . . . . . . .
6.3 The ART1 Algorithm . . . . . . . . . .
6.4 The ART2 Architecture . . . . . . . . .
6.5 ART2 Dynamics . . . . . . . . . . . . .
6.5.1 The F1 layer . . . . . . . . . . .
6.5.2 The F2 Layer . . . . . . . . . .
6.5.3 The Reset Layer . . . . . . . . .
6.5.4 Learning and Initialization . . . .
6.6 The ART2 Algorithm . . . . . . . . . .
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
79
81
83
83
85
85
87
87
90
93
94
96
97
99
100
102
102
103
103
104
107
II Basic Principles
109
7 Pattern Recognition
111
7.1 Patterns: The Statistical Approach . . . . . . . . . . . . . . . .
7.1.1 Patterns and Classi cation . . . . . . . . . . . . . . . .
7.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . .
7.1.3 Model Complexity . . . . . . . . . . . . . . . . . . . . .
7.1.4 Classi cation: Making Decisions and Minimizing Risk . .
7.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 The Discriminant Functions . . . . . . . . . . . . . . . .
7.2.2 Likelihood Function and Maximum Likelihood Procedure
7.3 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
113
114
116
120
120
121
125
x
CONTENTS
8 Single Layer Neural Networks
8.1 Linear Separability . . . . . . . . . . . . . . . . . . . . .
8.1.1 Discriminant Functions . . . . . . . . . . . . . .
8.1.2 Neuronal Memory Capacity . . . . . . . . . . . .
8.1.3 Logistic discrimination . . . . . . . . . . . . . .
8.1.4 Binary pattern vectors . . . . . . . . . . . . . . .
8.1.5 Generalized linear discriminants . . . . . . . . . .
8.2 The Least Squares Technique . . . . . . . . . . . . . . .
8.2.1 The Error Function . . . . . . . . . . . . . . . .
8.2.2 The Pseudo{inverse solution . . . . . . . . . . .
8.2.3 The Gradient Descent Solution . . . . . . . . . .
8.3 The Perceptron . . . . . . . . . . . . . . . . . . . . . .
8.3.1 The Error Function . . . . . . . . . . . . . . . .
8.3.2 The Learning Procedure . . . . . . . . . . . . . .
8.3.3 Convergence of Learning . . . . . . . . . . . . .
8.4 Fisher Linear Discriminant . . . . . . . . . . . . . . . . .
8.4.1 Two Classes Case . . . . . . . . . . . . . . . . .
8.4.2 Connections With The Least Squares Technique .
8.4.3 Multiple Classes Case . . . . . . . . . . . . . . .
129
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9 Multi Layer Neural Networks
9.1 Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . .
9.2 Threshold Neurons . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 Binary Vectors . . . . . . . . . . . . . . . . . . . . . . . .
9.2.2 Continuous Vectors . . . . . . . . . . . . . . . . . . . . .
9.3 Sigmoidal Neurons . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.1 Three Layer Networks . . . . . . . . . . . . . . . . . . . .
9.3.2 Two Layer Networks . . . . . . . . . . . . . . . . . . . . .
9.4 Weight-Space Symmetry . . . . . . . . . . . . . . . . . . . . . .
9.5 Higher-Order Neuronal Networks . . . . . . . . . . . . . . . . . .
9.6 Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . .
9.6.1 Error Backpropagation . . . . . . . . . . . . . . . . . . .
9.6.2 Application: Sigmoidal Neurons and Sum-of-squares Error
9.7 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8 Hessian Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.1 Diagonal Approximation . . . . . . . . . . . . . . . . . . .
129
129
132
134
136
137
137
137
139
142
144
144
145
145
147
147
149
151
153
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
153
154
155
155
157
158
158
160
161
161
161
163
164
166
166
CONTENTS
9.8.2
9.8.3
9.8.4
9.8.5
9.8.6
xi
Outer Product Approximation
Inverse Hessian . . . . . . . .
Finite Di erences . . . . . . .
Exact Hessian . . . . . . . . .
Multiplication with Hessian . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Radial Basis Function Networks
10.1 Exact Interpolation . . . . . . . . . . . .
10.2 Radial Basis Function Networks . . . . . .
10.3 Relation to Other Theories . . . . . . . .
10.3.1 Relation to Regularization Theory
10.3.2 Relation to Interpolation Theory .
10.3.3 Relation to Kernel Based Method
10.4 Classi cation . . . . . . . . . . . . . . . .
10.5 Network Learning . . . . . . . . . . . . .
10.5.1 Radial Basis Functions . . . . . .
10.5.2 Output Layer Weights . . . . . . .
167
167
168
169
170
173
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 Error Functions
11.1 Generalities . . . . . . . . . . . . . . . . . . .
11.2 Sum-of-Squares Error . . . . . . . . . . . . . .
11.2.1 Linear Output Units . . . . . . . . . . .
11.2.2 Linear Sum Rules . . . . . . . . . . . .
11.2.3 Signi cance of Network Output . . . . .
11.2.4 Outer product approximation of Hessian
11.3 Minkowski Error . . . . . . . . . . . . . . . . .
11.4 Input-dependent Variance . . . . . . . . . . . .
11.5 Modeling Conditional Distributions . . . . . . .
11.6 Classi cation using Sum-of-Squares . . . . . . .
11.6.1 Hidden Neurons . . . . . . . . . . . . .
11.6.2 Weighted Sum-of-Squares . . . . . . . .
11.6.3 Loss Matrix . . . . . . . . . . . . . . .
11.7 Cross Entropy . . . . . . . . . . . . . . . . . .
11.7.1 Two Classes Case . . . . . . . . . . . .
11.7.2 Sigmoidal Activation Functions . . . . .
11.7.3 Cross-Entropy Properties . . . . . . . .
173
174
175
175
177
178
178
180
180
182
185
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
186
187
188
189
191
192
193
194
197
198
199
201
202
202
203
203
xii
CONTENTS
11.7.4 Multiple Independent Features
11.7.5 Multiple Classes Case . . . . .
11.8 Entropy . . . . . . . . . . . . . . . .
11.9 Outputs as Probabilities . . . . . . . .
12 Parameter Optimization
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Error Surfaces . . . . . . . . . . . . . . . . . . . . . . .
Local Quadratic Approximation . . . . . . . . . . . . . .
Initialization and Termination of Learning . . . . . . . .
Gradient Descent . . . . . . . . . . . . . . . . . . . . .
12.4.1 Learning Parameter and Convergence . . . . . .
12.4.2 Momentum . . . . . . . . . . . . . . . . . . . .
12.4.3 Other Gradient Descent Improvement Techniques
Line Search . . . . . . . . . . . . . . . . . . . . . . . .
Conjugate Gradients . . . . . . . . . . . . . . . . . . . .
12.6.1 Conjugate Search Directions . . . . . . . . . . .
12.6.2 Quadratic Error Function . . . . . . . . . . . . .
12.6.3 The Algorithm . . . . . . . . . . . . . . . . . . .
12.6.4 Scaled Conjugated Gradients . . . . . . . . . . .
Newton's Method . . . . . . . . . . . . . . . . . . . . .
Levenberg-Marquardt Algorithm . . . . . . . . . . . . .
13 Feature Extraction
13.1
13.2
13.3
13.4
13.5
13.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Pre/Post-processing . . . . . . . . . . . . . . . . . . . . .
Input Normalization . . . . . . . . . . . . . . . . . . . . .
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . .
Time Series Prediction . . . . . . . . . . . . . . . . . . .
Feature Selection . . . . . . . . . . . . . . . . . . . . . .
Dimensionality Reduction . . . . . . . . . . . . . . . . . .
13.6.1 Principal Component Analysis . . . . . . . . . . .
13.6.2 Non-linear Dimensionality Reduction Trough ANN .
13.7 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . .
13.7.1 The Tangent Prop Method . . . . . . . . . . . . .
13.7.2 Preprocessing . . . . . . . . . . . . . . . . . . . .
13.7.3 Shared Weights . . . . . . . . . . . . . . . . . . .
13.7.4 Higher-order ANNs . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
204
205
207
212
215
215
216
217
218
220
221
223
225
225
225
226
229
231
233
234
237
237
238
239
240
241
243
243
245
247
247
248
250
250
CONTENTS
xiii
14 Learning Optimization
14.1 The Bias-Variance Tradeo . . . . . . . . . . . . . . . .
14.2 Regularization . . . . . . . . . . . . . . . . . . . . . . .
14.2.1 Weight Decay . . . . . . . . . . . . . . . . . . .
14.2.2 Linear Transformation And Weight Decay . . . .
14.2.3 Early Stopping . . . . . . . . . . . . . . . . . . .
14.2.4 Curvature Smoothing . . . . . . . . . . . . . . .
14.2.5 Choosing weight decay hyperparameter . . . . .
14.3 Adding Noise . . . . . . . . . . . . . . . . . . . . . . .
14.4 Soft Weight Sharing . . . . . . . . . . . . . . . . . . . .
14.5 Growing And Pruning Methods . . . . . . . . . . . . . .
14.5.1 Cascade Correlation . . . . . . . . . . . . . . . .
14.5.2 Pruning Techniques . . . . . . . . . . . . . . . .
14.5.3 Neuron Pruning . . . . . . . . . . . . . . . . . .
14.6 Committees of Networks . . . . . . . . . . . . . . . . .
14.7 Mixture Of Experts . . . . . . . . . . . . . . . . . . . .
14.8 Other Training Techniques . . . . . . . . . . . . . . . .
14.8.1 Cross-validation . . . . . . . . . . . . . . . . . .
14.8.2 Stacked Generalization . . . . . . . . . . . . . .
14.8.3 Complexity Criteria . . . . . . . . . . . . . . . .
14.8.4 Model For Mixed Discrete And Continuous Data
15 Bayesian Techniques
15.1 Bayesian Learning . . . . . . . . . . . . . . . .
15.1.1 Weight Distribution . . . . . . . . . . .
15.1.2 Gaussian Prior Weight Distribution . . .
15.1.3 Application | Simple Classi er . . . . .
15.1.4 Gaussian Noise Model . . . . . . . . . .
15.1.5 Gaussian Posterior Weight Distribution .
15.1.6 Consistent Prior Weight Distribution . .
15.1.7 Approximation Of Weight Distribution .
15.2 Network Outputs Distribution . . . . . . . . . .
15.2.1 Generalized Linear Networks . . . . . .
15.3 Classi cation . . . . . . . . . . . . . . . . . . .
15.4 The Evidence Approximation For And . . .
15.5 Integration Over And
...........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
253
253
255
255
257
258
259
259
260
261
264
264
266
268
268
271
272
272
273
273
274
275
275
275
276
276
279
280
280
280
281
283
284
287
291
xiv
CONTENTS
15.6 Model Comparison . . . . . .
15.7 Committee Of Networks . . .
15.8 Monte Carlo Integration . . .
15.9 Minimum Description Length
15.10Performance Of Models . . .
15.10.1Risk Averaging . . .
.
.
.
.
.
.
16 Tree Based Classi ers
16.1 Tree Classi ers . . . . . . . . .
16.2 Splitting . . . . . . . . . . . .
16.2.1 Impurity based method
16.2.2 Deviance based method
16.3 Pruning . . . . . . . . . . . .
16.4 Missing Data . . . . . . . . . .
17 Belief Networks
17.1 Graphs . . . . . . . . . . . .
17.1.1 Markov Properties . .
17.1.2 Markov Trees . . . .
17.1.3 Decomposable Trees
17.2 Casual Networks . . . . . . .
17.3 The Boltzmann Machine . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
292
294
295
297
298
298
301
301
302
302
303
304
306
307
307
308
310
312
313
314
III Advanced Topics
317
18 Matrix Operations on ANN
319
18.1 New Matrix Operations . . . . .
18.2 Algorithms . . . . . . . . . . . .
18.2.1 Backpropagation . . . .
18.2.2 SOM/Kohonen Networks
18.2.3 BAM/Hop eld Networks
18.3 Conclusions . . . . . . . . . . .
A Mathematical Sidelines
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
319
320
320
321
322
323
325
A.1 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A.1.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A.1.2 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 325
CONTENTS
xv
A.2 Generalized Spherical Coordinates . . . . . . . . . . . . . . . . .
A.3 Properties of Symmetric Matrices . . . . . . . . . . . . . . . . . .
A.3.1 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . .
A.3.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . .
A.4 The Gaussian Integrals . . . . . . . . . . . . . . . . . . . . . . .
A.4.1 The Unidimensional Case . . . . . . . . . . . . . . . . . .
A.4.2 The Multidimensional Case . . . . . . . . . . . . . . . . .
A.4.3 The multidimensional Gaussian integral with a linear term
A.5 The Euler Functions . . . . . . . . . . . . . . . . . . . . . . . . .
A.5.1 The Euler function . . . . . . . . . . . . . . . . . . . . .
A.5.2 The sphere volume in the n{dimensional space . . . . . .
A.6 The Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . .
A.7 Useful Mathematical equations . . . . . . . . . . . . . . . . . . .
A.7.1 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . .
A.7.2 The Jensen's inequality . . . . . . . . . . . . . . . . . . .
A.7.3 The Stirling Formula . . . . . . . . . . . . . . . . . . . .
A.8 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . .
A.9 Principal Components . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Statistical Sidelines
B.1 Probabilities . . . . . . . . . . . . . . . . . . . . . .
B.1.1 Probabilities and Bayes Theorem . . . . . . .
B.1.2 Probability Density, Expectation and Variance
B.2 Modeling the Density of Probability . . . . . . . . .
B.2.1 The Parametric Method . . . . . . . . . . . .
B.2.2 The non-parametric method . . . . . . . . .
B.2.3 The Semi{Parametric Method . . . . . . . .
B.3 The Bayesian Inference . . . . . . . . . . . . . . . .
B.4 The Robbins;Monro algorithm . . . . . . . . . . . .
B.5 Learning vector quantization . . . . . . . . . . . . .
326
327
327
329
329
329
329
330
331
331
331
332
333
334
334
334
336
337
338
341
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
341
341
343
344
345
350
354
361
364
366
Bibliography
369
Index
371
ANN Architectures
CHAPTER
1
Basic Neuronal Dynamics
➧ 1.1
Simple Neurons and Networks
First attempts at building arti cial neural networks (ANN) were motivated by the desire to
create models for natural brains. Much later it was discovered that ANN are a very general
statistical1 framework for modelling posterior probabilities given a set of samples (the input
data).
The basic building block of a (arti cial) neural network (ANN) is the neuron. A neuron is a
processing unit which have some (usually more than one) inputs and only one output. See
gure 1.1 on the following page. First each input xi is weighted by a factor wi and the
whole sum of inputs is calculated
wi xi = a. Then a activation function f is applied
P
all inputs
to the result a. The neuronal output is taken to be f (a).
Generally the ANN are build by putting the neurons in layers and connecting the outputs of
neurons from one layer to the inputs of the neurons from the next layer. See gure 1.2 on
the next page. The type of network depicted there is also named feedforward (a feedforward
network do not have feedbacks, i.e. no \loops"). Note that there is no processing on the
layer 0, its role is just to distribute the inputs to the next layer (data processing really starts
with layer 1); for this reason its representation will be omitted most of the time.
Variations are possible: the output of one neuron may go to the input of any neuron,
including itself; if the outputs on neuron from one layer are going to the inputs of neurons
from previous layers then the network is called recurrent , this providing feedback; lateral
1.* For more information see also [BB95], it provides with some detailed theoretical neuronal models for
true neurons.
1 In the second part of this book it is explained in greater detail how the ANN output have a statistical
signi cance.
3
4
CHAPTER 1. BASIC NEURONAL DYNAMICS
w
xi
P
i
a
f (a)
f (a)
weighted sum
unit
activation
unit
inputs
output
Figure 1.1:
(layer 0)
xi
i
layer 1
The neuron
layer L ; 1 layer L
xi
xi
input
Figure 1.2:
output
The general layout of a (feedforward) neural network.
Layer 0 distributes the input to the input layer 1. The
output of the network is (generally) the output of the output layer
(last layer).
L
feedback is done when the output of one neuron goes to the other neurons on the same
layer2 .
So, to compute the output, an \activation function" is applied on the weighted sum of
inputs:
total input a =
X
all inputs
wi xi
0
X
output = activation function @
all inputs
2 This
is used into the SOM/Kohonen architecture.
wi
1
x A = f (a)
i
1.2. NEURONS AS FUNCTIONS
5
More general designs are possible, e.g. higher order ANNs where the total input to a neurons
contains also higher order combinations between inputs (e.g. 2-nd order terms, of the form
wij xi xj ); however these are seldom used in practice as it involves huge computational e orts
without clear-cut bene ts.
The tunable parameters of an ANN are the weights fwi g. They are found by di erent
mathematical procedures3 by using a given set of data. The procedure of nding the
weights is named learning or training. The data set is called learning or training set and
contain pairs of input vectors associated with the desired output vectors: f(xi ; yi )g. Some
ANN architectures do not need a learning set in order to set their weights, in this case the
learning is said to be unsupervised (otherwise the learning being supervised ).
✍
Remarks:
➥
➥
➥
➧ 1.2
Usually the inputs are distributed to all neurons of the rst layer, this one being
called the input layer | layer 1 in gure 1.2 on the facing page. Some networks
may use an additional layer which neurons receive each one single component
of the total input and distribute it to all neurons of the input layer. This layer
may be seen as a sensor layer and (usually) doesn't do any (real) processing. In
some other architectures the input layer is also the sensor layer. Unless otherwise
speci ed it will be omitted.
The last layer is called the output layer. The output set of the output neurons
is (commonly) the desired output (of the network).
The layers between input and output are called hidden layers.
Neurons as Functions
Neurons behave as functions. Neurons transduce an unbounded input activation x(t) at a
time t into a bounded output signal f (x(t)). Usually a sigmoidal or S{shaped curve, as
in gure 1.3 on the next page describes the transduction. This function (f ) is called the
activation or signal function.
The most used function is the logistic signal function:
f (a)
=
1
1 + e ca
;
which is sigmoidal and strictly increases for positive scaling constant c > 0. Strict monotonicity implies that the activation derivative of f is positive:
f
0
df
da
=
cf (1
;
f) >
0
The threshold signal function (dashed line) in gure 1.3 on the following page illustrates
a non-di erentiable signal function. The family of logistic signal function, indexed by c,
approaches asymptotically the threshold function as c ! +1. Then f transduce positive
activations signals a to unity signals and negative activations to zero signals. A discontinuity
3
Most usual is the gradient-descent method and derivatives.
❖
c
6
CHAPTER 1. BASIC NEURONAL DYNAMICS
f (a)
;1
Figure 1.3:
❖
f_
;
0
1
a
+
Signal f (a) as a bounded monotone-nondecreasing function of activation a. Dashed curve de nes a threshold
signal function.
occurs at the zero activation value (which equals the signal function's \threshold"). Zero
activation values seldom occur in large neural networks4.
The signal velocity df =dt, denoted f_, measures the signal's instantaneous time change. f_
is the product between the change in the signal due to the activation and the change in the
activation with time:
f_
➧ 1.3
=
df da
da dt
=
0
f a
_
Common Signal Functions
The following activation functions are more often encountered in practice:
1. Logistic:
f (a)
where c > 0 ;
f
2.
0=
df
da
=
c
cf (1
=
1
1 + e ca
;
const. is a positive scaling constant. The activation derivative is
=
; ) and so is monotone increasing (
f
f
f >
most common one.
0
). This function is the
Hyperbolic-tangent:
f (a)
where c > 0 ;
f
0=
df
da
= c(1
c
=
;
f
= tanh(ca) =
e
ca
;
e
;ca
ca + e;ca
e
const. is a positive scaling constant. The activation derivative is
2)
>
0
and so f is monotone increasing (f < 1).
4 Threshold activation functions were used in early developments of ANN, e.g. perceptrons, however
because they were not di erentiable they represented an obstacle in the development of ANNs till the
sigmoidal functions were adopted and gradient descent techniques (for weight adaptation) were developed.
1.3. COMMON SIGNAL FUNCTIONS
3. Threshold:
7
8
>
<1 if > 1c
( ) = >0 if
0
: otherwise ( 2 [0 1 ])
a
f a
a <
ca
x
;
=c
where c > 0 ; c = const. is a positive scaling constant. The activation derivative is:
0
f ( a) =
(
df
da
= 0 if 2 (;1 0) [ [1
otherwise
a
;
=c;
1)
c
Note that is not a true threshold function as it have a non-in nite slope between 0
and c.
4. Exponential-distribution:
( ) = max(0 1 ; ;ca )
f a
;
e
where c > 0 ; c = const. is a positive scaling constant. The activation derivative is:
f
0 (a) =
df
da
= ;ca
ce
and for a > 0, supra-threshold signals are monotone increasing (f 0 > 0). Note: since
the second derivative f 00 = ;c2 e;ca the exponential-distribution function is strictly
convex.
5. Ratio-polynomial:
where c > 0 ;
c
n
( ) = max 0 + n
for
= const.. The activation derivative is:
f a
a
;
0=
c
a
df
cna
n;1
da
c
a
n >
1
= ( + n )2
and for positive activation supra-threshold signals are monotone increasing.
6. Pulse-coded: In biological neuronal systems the information seems to be carried
by pulse trains rather than individual pulses. Train-pulse coded information can be
decoded more reliably than shape-pulse coded information (arriving individual pulses
can be somewhat corrupted in shape and still accurately decoded as present or absent).
The exponentially weighted time average of sampled binary pulses function is:
f
( )=
Z
f t
t
t
;1
( ) s;t
g s
e
ds
being time, where the function g is:
❖
(
( ) = 1 if a pulse occurs at
0 if no pulse at
g t
t
t
and equals one if a pulse arrives at time t or zero if no pulse arrives.
The pulse-coded signal function is: f (t) : [0; 1] ! [0; 1].
,g
t
8
CHAPTER 1. BASIC NEURONAL DYNAMICS
If g(t) = 0; 8 t then f (t) = 0 (trivial).
If g(t) = 1; 8 t then
Z t
s;t ds = et;t ; lim es;t = 1
f (t) =
e
s!;1
;1
Proof.
When the number of arriving pulses increase then the \pulse count" can only increase so
monotone nondecreasing.
f
is
CHAPTER
2
The Backpropagation Network
The backpropagation network represents one of the most classical example of an ANN,
being also one of the most simple in terms of the overall design.
➧ 2.1
Network Structure
The network consists of several layers of neurons. The rst one (let it be layer 0) distributes
the inputs to the input layer 1. There is no processing in layer 0, it can be seen just as a
sensory layer | each neuron receive just one component of the input (vector) x which gets
distributed, unchanged, to all neurons from the input layer. The last layer is the output
layer which outputs the processed data; each output of individual output neurons being a
component of the output vector y. The layers between the input one and the output one
are hidden layers.
✍
Remarks:
Layer 0 have to have the same number of neurons as the number of input components (dimension of input vector x).
➥ The output layer have to have the same number of neurons as the desired output
have (i.e. the dimension of the output vector y dictates the number of neurons
on the output layer).
➥ In general the input and hidden layers may have any number of neurons, however
their number may be chosen to achieve some special e ects in some practical
cases.
The network is a straight feedforward network: each neuron receives as input the outputs
➥
9
feedforward
network
10
CHAPTER 2. THE BACKPROPAGATION NETWORK
layer 1 layer ` ; 1
(layer 0)
xi
layer L ; 1
layer `
xi
z`k
k
i
layer L
w`kj
xi
;1
j
z`
;j
input
output
|{z}
xp
|{z}
|{z}
N`
tp
Figure 2.1: The backpropagation network structure.
❖
z`k
,
,
w`kj
,
,
x ,
, ,
p
tp (xp ) z0i N` L
P
of all neurons from the previous layer (excepting the rst sensory layer). See gure 2.1.
The following notations are used:
z is the output of neuron j from layer `.
w is the weight by which output of neuron j from layer ` ; 1 contribute to input
of neuron k from layer `.
x is training input vector no. p.
t (x ) is the target (desired output) vector no. p (at training time).
z0 is the i component of input vector. By notation, at training time, z0 x , where
x is the component i of one of input vectors, for some p.
N is the number of neurons in layer `.
L is the number of layers (the input layer is no. 0, the output layer is no. L).
P is the number of training vectors, p = 1; P
The learning set is (according to the above notation) f(x ; t )g =1 .
`k
`kj
p
p
p
i
i
i
i
`
p
➧ 2.2
2.2.1
p
p
;P
Network Dynamics
Neuron Output Function
The activation function used is usually the logistic:
f (a)
df
da
1
;
exp(;
=
[1 + exp(;
=
1 + exp(
c
;
ca)
ca)
ca)]
2
f
=
:
R ! (0; 1) ;
cf (a)
[1
;
f (a)]
c >
0
; c
=
const.
(2.1)
(2.2)
2.2. NETWORK DYNAMICS
11
but note that the backpropagation algorithm is not particularly tied up to it.
2.2.2
Network Running Function
Each neuron compute at output the weighted sum of its input to which it applies the signal
function (see also gure 2.1 on the facing page):
1
0 `;
X
z = f @ w z ;1 A
N
1
`kj
`k
j
`
(2.3)
;j
=1
It must be computed in succession for each layer, starting from input and going trough all
layer in succession till the output layer is reached.
A more compact matrix notation may be developed as follows. For each layer (except 0), a
matrix of weights W is build as:
1
0 w w
11
1 `;
.. C
W =B
@ ... . . .
. A
`
`
`
w
` N
w
`1
`N
zT
`;1
` `;1
z ;1
;1 1
;
`
;N
`;1
the output of the actual layer ` may be calculated as:
zT
`
=
;
f (aT ) = f (a 1) f (a
`
`
=
2.2.3
Network Learning Function
`
`
❖ a`
`
The network learning process is supervised i.e. the network receives (at training phase) both
the raw data as inputs and the targets as output. The learning involves adjusting weights
so that errors will be minimized. The function used to measure errors is usually the sum-ofsquares de ned below but note that backpropagation algorithm is not particularly tied up
to it.
De nition 2.2.1. For an input pattern x and the associated target t, the sum-of-squares
error function E (W ) (E is dependent on all weights W ) is de ned as:
E (W ) 2
1
XL
N
q
=1
z
[
Lq (
Lq
Lq
❖
E ,W
❖
z
p
x) ; tq (x)]2
where z is the output of neuron q from the output layer i.e. the component q of the
output vector.
Note that all components of input vector will in uence any component of output vector,
thus z = z (x).
Lq
❖ z`
`)
`N
W z ;1 .
where a
`
`N N
`
`
W
1
(note that all weights associated with a particular neuron are on the same row) then,
considering the output vector of the previous layer z ;1
;
= z
❖
Lq
12
❖
CHAPTER 2. THE BACKPROPAGATION NETWORK
E
✍
Remarks:
➥
tot.
Considering all learning samples (the full training set) then the total sum-ofsquares error sum-of-squares error E (W ) (E is also dependent of all weights
as E ) is de ned as:
tot.
E
tot.
❖
NW
tot.
NL
P
P X
X
X
(W ) 12 E (W ) =
[zLq (xp ) ; tq (xp )]
2
p=1
p=1 q=1
The network weights are found (weights are adapted/changed) step by step. Considering
NW the total number of weights then the error function E : RNW !(R may be)represented
as a surface in the RNW +1 space. The gradient vector rE
❖
t
=
@E (W )
@w`ji
shows the
direction of (local) maximum of square mean error and fw`ji g are to be changed in the
opposite direction (so \;" have to be used) | see also gure 2.3 on the next page.
In the discrete time t approximation, at step t + 1, given the weights at step t, the weights
are adjusted as:
w`ji (t + 1) = w`ji (t) ;
❖
(2.4)
P
X
@Ep (W )
@E (W )
=
w`ji (t) ;
@w`ji W (t)
p=1 @w`ji W (t)
where = const., > 0 is named the learning constant and it is used for tuning the speed
and quality of the learning process.
In matrix notation the above equation may be written simply as:
W (t + 1) = W (t) ; rE
(2.5)
because the error gradient may also be considered as a matrix or tensor, it have the same
dimensions as W .
✍
Remarks:
➥
delta rule
➥
The above method does not provide for a starting point. In practice, weights are
initialized with small random values (usually in the [;1; 1] interval).
The (2.5) equation represents the basics of weight adjusting, i.e. learning, in
many ANN architectures; it is known as the delta rule:
W = W (t + 1) ; W (t) / ;rE
➥
➥
➥
If is too small then the learning is slow and it may stop the learning process
into a local minimum (of E ), being unable to overtake a local maximum. See
gure 2.3 on the facing page.
If is too large then the learning is fast but it may jump over local minimum
(of E ) which may be deeper than the next one. See gure 2.3 on the next page
(Note that in general that is a surface).
Another point to consider is the problem of oscillations. When approaching error
minima, a learning step may overshot it, the next one may again overshot it
2.2. NETWORK DYNAMICS
13
E (w )
w(t + 3)
w ( t)
w(t + 2)
w(t + 1)
w
Figure 2.2:
Oscillations in learning process. The weights move
around E minima without being able to reach it (arrows
show the jumps made by the learning process).
E (w )
rE
local maximum
local minimum
w
Figure 2.3:
E (w) | Total square error as function of weights. rE
points towards the (local) maximum.
bringing back the weights to a similar point to previous one. The net result is
that weights are changed to values around minima but never able to reach it.
See gure 2.2. This problem is particularly likely to occur for deep and narrow
minima because in this case rE is large (deep steep) and subsequently W
is large (narrow easy to overshot).
The problem is to nd the error gradient rE . Considering the \standard" approach (i.e.
frE g`ji E for some small w`ji ) this would require an computational time of the
w`ji
order O(NW ), because each calculation of E require O(NW ) and it have to be repeated
2
for each w`ji in turn.
The importance of the backpropagation algorithm resides in the fact that it reduces the
computational time of rE to O(NW ), thus greatly improving the speed of learning.
Theorem 2.2.1. Backpropagation algorithm. For each layer (except 0, input), an error
gradient matrix may be build as follows:
backpropagation
❖
(rE )`
14
CHAPTER 2. THE BACKPROPAGATION NETWORK
0 @E
B @w..`11 . .
(r )` B
.
@ .
E
❖
r z`
E
..
.
@E
@w`N` 1
1
CC
A
@E
@w`1N`;1
@E
;
`
= 1; L
@w`N` N`;1
For each layer, except L the error gradient, with respect to neuronal outputs, may be de ned
as:
rz`
E
@E
@z`1
@E
@z`N`
;
`
= 1; L
;1
The error gradient with respect to network output zL is considered to be known and dependent only on network outputs fzL (xp )g and the set of targets ftp g.
rzL
=
E
known.
Then considering the error function E and the activation function f and its total derivative
0 then the error gradient may be computed recursively according to the formulas:
f
T rz E f 0 (a`+1 ) calculated recursively from L ; 1 to 1 (2.6a)
+1
`+1
0
T
(2.6b)
(rE )` = [rz` E
f (a` )] z`;1
for layers ` = 1; L
rz l
❖ z0
=
E
W`
where z0 x.
Proof.
The error E (W ) is dependent on w`ji trough the output of neuron (j; i) i.e. z`j :
@E
@w
`ji
=
@ z`j
`j @ w`ji
@E
@z
and each derivative is computed separately
❐
term
`j
`ji
@z
@w
`j
@w`ji
@z
@
=
`ji
@w
2 0N`;1
4 @X
f
m=1
0N`;1
`jm z
w
13
`;1;m A5 =
1
0N`;1
1
0 X w`jm z`;1;m A @ @ X w`jm z`;1;m A =
=f @
@w`ji
m=1
m=1
=
f
0 (a`j ) z`;1;i
because weights are mutually independent.
❐
term
@E
@z
`j
Neuron z`j a ect E trough all following layers that are intermediate between layer
in uence being exercised through the interposed neurons).
First a ected is next,
successive layers):
@E
@z
`j
=
`
, layer, trough term
+1
NX
`+1
m=1
@E
`+1;m
@z
@z
`+1;m
`j
@z
=
@z
`
and output (the
`+1;m (and then the dependency is carried on next
`j
@z
2.2. NETWORK DYNAMICS
=
=
=
NX
`+1
15
@E
m=1 @z`+1;m
NX
`+1
@E
m=1 @z`+1;m
NX
`+1
13
2 0N
4f @X̀ w`+1;mq z`q A5 =
@
@z`j
0N
X̀
f0 @
q=1
q=1
w`+1;mq z`q
1
A
@
@z`j
0N
1
@X̀ w`+1;mq z`q A =
q=1
@E
f 0 (a`+1;m ) w`+1;mj
@z`+1;m
m=1
which represents exactly the element j of column matrix rz` E as build from (2.6a). The above formula
applies iteratively from layer L ; 1 to 1, for layer L, rzL E is assumed known.
Finally, the desired derivative is:
@E
@w`ji
=
@E @z`j
@z`j @w`ji
2N`+1
X
=4
@E
3
m=1 @z`+1;m
f 0 (a`+1;m ) w`+1;mj 5 f 0 (a`j ) z`;1;i
representing the element found at row j , column i of matrix (rE )` as build from (2.6b).
Proposition 2.2.1. If using the logistic activation function and sum-of-squares error function then the error gradient may be computed recursively according to the formulas:
(2.7a)
rzL E = zL (x) ; t
rz ` =
E
h
T
`
cW +1
rz`+1
h
(r )` = rz`
E
c
E
E
z`+1
i
(1b ; z` )
+1
(1b ; z`) z`;
z`
T
1
i
for ` = 1; L ; 1
(2.7b)
for ` = 1; L
(2.7c)
where z0 x
Proof.
From de nition 2.2.1:
@E
@zLj
=
zLj
; tj ) rzL E = zL ; t
By using (2.2) in the main results (2.6a) and (2.6b) of theorem 2.2.1, and considering that f (a` ) = z` the
other two formulas are deducted immediately.
2.2.4
Initialization and Stop
Weights are initialized (in practice) with small random values and the adjusting process
continue by iteration.
The stopping of the learning process can be done by one of the following methods:
➀ choosing a xed number of steps t = 1; T .
➁ the learning process continue until the adjusting quantity w`ji = w`ji(at time t+1) ;
w`ji(at time t) is under some speci ed value, 8`, 8j , 8i.
➂ learning stops when the total error, e.g. the total sum-of-squares Etot. , attain a minima
on a test set, not used for learning.
✍
Remarks:
➥
If the trained network performs well on the training set but have bad results
on previously unseen patterns (i.e. it have pour generalization capabilities) then
16
CHAPTER 2. THE BACKPROPAGATION NETWORK
this is usually a sign of overtraining (assuming, of course, that the network is
reasonably build and there are a sucient number of training patterns).
➧ 2.3
The Algorithm
The algorithm is based on discrete time approximation, i.e. time is t = 0; 1; 2; : : : .
The activation and error functions and the stop condition are presumed to be chosen (known)
and xed.
Network running procedure:
1. The input layer is initialised, i.e. the output of input layer is made to be x:
z0
x
For all layers ` = 1; L | starting with rst hidden layer 1 | do:
z` = f (W` z`;1 )
2. The output of the network is taken to be the output of the output layer, i.e. y zL .
Network learning procedure:
1. Initialize all fw`ji g weights with (small) random values.
2. For all training sets (xp ; tp ) (as long as the stop condition is not met) do:
(a) Run the network to nd the activations on all neurons a` and then the derivatives
0
f (a` ). The network output yp zL (xp ) = f (aL ) will also be required on next
step.
NOTE: The algorithm require the derivatives of activation functions for all neurons. For most activation functions this may be expressed in terms of activation
itself, i.e. f 0 (a` ) = g(z` ), as is the case for logistic, see (2.2). This approach may
reduce the memory usage or increase speed (or both in case of logistic function).
(b) Using (yp ; tp ), calculate rzL E , e.g. for sum-of-squares use (2.7a).
(c) Compute the error gradient.
For output layer (rE )L is calculated directly from (2.6b) (or from (2.7c)
for sum-of-squares and logistic).
For all other layers ` = 1; L ; 1, going backwards from L ; 1 to 1, calculate
rst rz` E using (2.6a) (or (2.7b) for sum-of-squares and logistic).
Then calculate (rE )` using (2.6b) (or respectively (2.7c)).
(d) Update the W weights according to the delta rule (2.5).
(e) Check the stop condition and exit if it have been met.
✍
1
Remarks:
➥
In most cases1 a better performance is obtained when training repeatedly with
the whole training set. A shuing of patterns is recommended, between repeats.
But e.g. not when patterns form a time series.
2.3. THE ALGORITHM
➥
17
Trying to stop error backpropagation when is below some threshold value may
also improve learning, e.g. in case of sum-of-squares the rzL E may be changed
to:
@E
@zLq
➥
(
=
➥
0
if jzLq ; tq j >
otherwise
for q = 1; NL
i.e. rounding towards 0 the elements of rzL E smaller than .
The classical way of calculating the gradient, i.e.
@E
@w`ji
➥
zLq ; tq
E (w`ji + ") 2;" E (w`ji ; ") ; " & 0
while too slow for direct usage, is an excellent tool for checking the correctness
of algorithm implementation.
There are not (yet) good theoretical methods of choosing the learning parameters
(constants) and . The practical, hands-on, approach is still the best. Usual
values for are in the range [0:1; 1] (but some networks may learn even faster
with > 1) and [0; 0:1] for .
In accordance with neuron output function (2.1) the output of the neuron have
values within (0; 1) (in practice, due to rounding errors, the range is in fact [0; 1]).
If the desired outputs have values within [0; 1) then the following transforming
function may be used:
y(x) = 1 ; exp(; x) ;
>0;
=
const.
which have the inverted:
y;1(x) =
➥
1
ln
1
1
;x
The same procedure described above may be used for inputs.
This kind of transformation can be used each time the desired input/output falls
beyond neuron activation function range.
By no means reaching the absolute minima of E is guaranteed. First the training
set is limited and the error minima with respect to the learning set may will
generally not coincide with the minima considering all possible patterns, but in
most cases should be close enough for practical applications.
On the other hand, the error surface have a symmetry, e.g. swapping two neurons from the same layer (or in fact their weights) will not a ect the network
performance, so the algorithm will not search trough the whole weight space but
rather a small area of it. This is also the reason for which the starting point,
given randomly, will not a ect substantially the learning process.
❖
18
CHAPTER 2. THE BACKPROPAGATION NETWORK
z0
z1
`
z
`
`
`N
layer `
1
w +1 1 0
`
Figure 2.4:
layer ` + 1
; ;
Bias may be emulated with the help of an additional neuron z 0 those output is always 1 and is distributed to all
neurons from next layer (exactly as for a \regular" neuron).
`
➧ 2.4
Bias
Some problems, while having an obvious solution, cannot be solved with the architecture
described above2. To solve these, the neuronal activation (2.3) is changed to:
0
z = f @w
`k
`;
X
`k
0+
j
=1
w z ;1
`kj
`
`k
`
z`
z `
`
and
`N
0w
10 w 11
B
.
..
f = @ ..
W
.
`
`
^
E)
`
w `0 w `1
`N
(r
(2.8)
`k
w 0
eT = ;1 z 1
❖
;j
1
A
and the new parameter w 0 introduced is named bias.
As it may be immediately be seen the change is equivalent to inserting a new neuron z 0 ,
whose activation (output) is always 1, on all layers except output. See gure 2.4.
The required changes in neuronal outputs and weight matrices are:
bias
❖
1
N
`N
...
w1
` N
w
..
.
`;1
1
CA
` `;1
`N N
biases being added as a rst column in W , and then the neuronal output is calculated as
f ez ;1).
z = f (a ) = f (W
`
`
`
`
^
^
`
`
f is:
The error gradient matrix (rE ) associated with W
0
B
E) = B
@
(r
`
`
@E
`10
@w
..
.
`
@E
@w
`N` 0
@E
`11
@w
..
.
@E
@w
`N` 1
@E
@w
`1N`;1
..
.
@E
`N` N`;1
1
CC
A
@w
Following the changes from above, the backpropagation theorem becomes:
2 E.g.
the tight encoder described later.
2.4. BIAS
19
Theorem 2.4.1.
Backpropagation with biases. If the error gradient with respect to
neuronal outputs rzL E is known, and depends only on (actual) network outputs fzL (xp )g
and targets ftp g:
rzL
=
E
known.
then the error gradient (with respect to weights) may be calculated recursively according to
formulas:
^
rz`
r
(
E
E)
=
W
T
`+1
rz`+1
` = [rz` E
E
f
0 (a`+1 ) calculated recursively from L ; 1 to 1 (2.9a)
0 (a` )] ezT
`;1
f
for layers ` = 1; L
(2.9b)
where z0 x.
See theorem 2.2.1 and its proof.
Equation (2.9a) results directly from (2.6a).
Proof.
^
^
Columns 2 to N ;1 + 1 of (rE ) represents (rE ) given by (2.6b).
`
`
`
The only terms to be calculated remains those of rst column of (rE ) , i.e. terms of the form
j being the row index. But these terms may be written as (see proof of theorem 2.2.1):
@E
@w`j 0
where
`j is term j of
@E
@z
rz
=
`
`j0 ,
@E
@w
@E @z`j
@z`j @w`j 0
` , already calculated, and from (2.8):
@z`j
@w`j 0
=
f 0 (a`j ) 1 = f 0 (a`j )
As ze ;1 1 1 (by construction) then formula (2.9b) proves correct.
`
;
Proposition 2.4.1. If using the logistic activation function and the sum-of-squares error
function then the error gradient may be computed recursively using the formulas:
rzL
E
^
rz`
r
(
E
E)
=
zL (x) ; t
=
cW
h
`=c
T
`+1
rz `
E
(2.10a)
h
rz`+1
z`
E
b
(1
z`+1
i
b
(1
i
; z`+1 ) for
; z` ) ezT`;1
`
= 1; L
;1
for ` = 1; L
(2.10b)
(2.10c)
where z0 x.
Proof.
✍
It is proved the same way as proposition 2.2.1 but using theorem 2.4.1 instead.
Remarks:
➥
➥
The algorithm for a backpropagation ANN with biases is (mutatis mutandi) identical to the one described in section 2.3.
In practice, biases are usually initialized with 0.
backpropagation
20
CHAPTER 2. THE BACKPROPAGATION NETWORK
➧ 2.5
Algorithm Enhancements
2.5.1 Momentum
The weight adaption described in standard (\vanilla") backpropagation (see section 2.2.3)
is very sensitive to small perturbations. If a small \bump" appears in the error surface the
algorithm is unable to jump over it and it will change direction.
This situation is avoided by taking into account the previous adaptations in learning process
(see also (2.5)):
W (t) = W (t + 1) ; W (t) = ; rE j
momentum
❖
W (t)
+ W (t ; 1)
(2.11)
This procedure is named backpropagation with momentum and 2 [0; 1) is named the
momentum (learning) parameter.
The algorithm is very similar (mutatis mutandi) to the one described in section 2.3. As
the main memory consumption is given by the requirement to store the weight matrix
(especially true for large ANN), the momentum algorithm requires double the amount of
standard backpropagation, to store W for next step.
✍
Remarks:
➥
When choosing the momentum parameter the following results have to be considered:
if > 1 then the contribution of each w grows in nitely.
if & 0 then the momentum contribution is insigni cant.
so should be chosen somewhere in [0:5; 1) (in practice, usually 0:9).
The momentum method assumes that the error gradient slowly decreases when
approaching the absolute minimum. If this is not the case then the algorithm
may jump over it.
Another improvement over momentum is the at spot elimination. If the error
surface is very at then rE e0 and subsequently W e0. This may lead to
a very slow learning due to the increased number of training steps required. To
avoid this problem, a change to the calculation of error gradient (2.6b) may be
performed as follows:
`ji
➥
➥
(2.6b) !
❖
c
(rE )
n
pseudo = rz E
`;
`
h
f 0 (a
`
io
) + c b1 z ;
f
T
`
1
where (rE ) pseudo is no more the real (rE ) . The c is named at spot elimination constant.
Several points to note here:
The procedure of adding a term to f 0 instead of multiplying it means that
(rE ) pseudo is more a ected when f 0 is smaller | a desirable e ect.
f
`;
`;
2.5.1
[BTW95] p. 50
`
f
2.5. ALGORITHM ENHANCEMENTS
21
W (t + 1)
W (t ; 1)
rE j
Figure 2.5:
; rE j
W (t)
W (t)
W (t)
Learning with momentum. A contributing term from the
previous step is added.
The error gradient terms corresponding to a weight close to input layer is
smaller than a similar term for a weight more closely to the output layer
because the e ect of changing the weight gets attenuated when propagated
trough layers. So another e ect of c is the speed up of weight adaptation
in layers close to input, again a desirable e ect.
The formulas (2.7c), (2.9b) and (2.10c) change the same way:
f
➥
n
(rE )
`;pseudo
(^
rE )
`;pseudo
= rz` E
(^
rE )
`;pseudo
= c rz` E
h
= c rz` E
n
n
io
(b ; ) + c b
z`
h
1
f 0 (a
`
h
z`
z`
f
io
1
T
z` 1
;
)+c b e ;
f
1
(2.12b)
T
z` 1
io
(b ; ) + c b e ;
1
z`
f
1
(2.12a)
T
z` 1
(2.12c)
In physical terms: The set of weights W may be though as a set of coordinates
de ning a point in the space R W . During learning, this point is moved towards
reducing the error E . The moment introduce an \inertia" proportional to ,
such that when changing direction under the in uence of rE \force" it have a
tendency to keep the old direction of movement and \overshot" the point given
by ;rE . See gure 2.5.
The momentum method assumes that if the weights have been moved in some
direction then this direction is good for the next steps and is kept as a trend:
unwinding the weight adaptation over 2 steps (applying (2.11) twice, for t ; 1
and t) it gives:
W (t) = ; rE j ( ) ; rE j ( ;1) + 2W (t ; 2)
N
W t
W t
and it can be seen that the contributions of previous W gradually disappear
with the increase of power of (as < 1).
2.5.2 Adaptive Backpropagation
The main idea of this algorithm came from the following observations:
If the slope of the error surface is gentle then a big learning parameter could be used
to speed up learning over at spot areas.
2.5.2
[BTW95] p. 50
22
CHAPTER 2. THE BACKPROPAGATION NETWORK
If the slope of the error surface is step then a small learning parameter should be used
to avoid overshooting the error minima.
In general the slopes are gentle in some directions and step in the other ones.
This algorithm is based on assigning individual learning rates for each weight w`ji based on
the previous behavior. This means that the learning constant becomes a matrix of the
same dimension as W .
The learning rate is increased if the gradient kept the direction over last two steps (i.e. is
likely to continue) and decreased otherwise:
(
I `ji (t ; 1) if w`ji (t) w`ji (t ; 1) > 0
`ji (t) =
D`ji (t ; 1) if w`ji (t) w`ji (t ; 1) < 0
❖
I, D
(2.13)
where I > 1 and D 2 (0; 1). The I parameter is named the adaptive increasing factor and
the D parameter is named the adaptive decreasing factor.
In matrix form, equation (2.13) may be written considering the matrix of w`ji sign changes,
i.e. sign[W (t) W (t ; 1)]:
(t) =
n
h
i
o
(I ; D) sign sign(W (t) W (t ; 1)) + e1 + D e1
(t ; 1)
(2.14)
Proof. The problem is to build a matrix containing 1-es corresponding to each w`ij (t)w`ji (t ; 1) > 0
and 0-es in rest. This matrix multiplied by I will be used to increase the corresponding `ji elements. The
complementary matrix will be used to modify the matching `ji which have to be decreased.
The sign(W (t) W (t ; 1)) matrix have elements consisting only of 1, 0 and ;1. By adding e1 and
taking the sign again, all 1 and 0 elements are transformed to 1 while the ;1 elements are transformed to
zero. So the desired matrix is
h
i
sign sign(W (t) W (t ; 1)) + e1
while its complementary is
e1 ; sign
h
Then the updating formula for nally becomes:
(t) = I (t ; 1)
i
sign(W (t) W (t ; 1)) + e1
h
i
sign sign(W (t) W (t ; 1)) + e1
n
h
io
+ D (t ; 1) e1 ; sign sign(W (t) W (t ; 1)) + e1
❖
0
✍
Remarks:
➥
f`ji g is initialized with a constant 0 and W (t ; 1) = e0. Learning parameter
matrix is updated after each training session (considering the current W (t)).
For the rest the same main algorithm as described in section 2.3 apply.
Note that after initialization, when (0) = 0 and W (0) = e0, the rst step
will lead automatically to the increase (1) = I 0 , so 0 should be chosen
accordingly.
Also, this algorithm requires three times as much memory compared to standard
backpropagation, to store and W for next step, both being of the same size
as W .
2.5. ALGORITHM ENHANCEMENTS
➥
➥
23
If I = 1 and D = 1 then the e ect of algorithm is obviously void.
In practice I 2 [1:1; 1:3] and D . 1=I gives the best results for a wide spectrum
of applications.
Note that sign(W (t) W (t ; 1)) could be replaced by sign(W (t))
sign(W (t ; 1)). This is a tradeo between one oating point multiplication
followed by a sign versus two sign operations followed by an integer multiplication;
whatever is faster may depend on the actual system used.
Due to the fact that the next change is not exactly in the direction of the error gradient
(because each component of rE is multiplied with a di erent constant) this technique may
cause problems. This may be avoided by testing the output after an adaptation has taken
place: if there is an increase in output error then the adaptation should be rejected and the
next step calculated with the classical method; then the adaptation process can be resumed
at next step.
2.5.3 SuperSAB
SuperSAB (Super Self-Adapting Backpropagation) is a combination of momentum and
adaptive backpropagation algorithms.
The algorithm uses adaptive backpropagation for the w`ji terms who continue to move in
the same direction and momentum for the others, i.e.:
If w`ji (t) w`ji (t ; 1) > 0 then:
`ji (t) = I `ji (t ; 1)
w`ji (t + 1) = ;`ji (t)
@E
@w`ji W (t)
the momentum being 0 because it's not necessary, the learning rate grows in geometrical progression due to the adaptive algorithm.
If w`ji (t) w`ji (t ; 1) < 0 then:
`ji (t) = D `ji (t ; 1)
w`ji (t + 1) = ;`ji
@E
;
@w`ji W (t)
w`ji (t)
Note the \;" sign in front of which being used to cancel the previous \wrong"
weight adaption (not to boost w`ji as in momentum method); the corresponding
`ji is decreased to get smaller steps.
In matrix notation SuperSAB rules are written as:
(t) =
n
h
W (t + 1) = ;(t) rE ; W (t)
2.5.3
[BTW95] p. 51
i
o
(I ; D) sign sign(W (t) W (t ; 1)) + e1 + D e1
n
e
h
(t ; 1)
io
1 ; sign sign(W (t) W (t ; 1)) + e1
24
CHAPTER 2. THE BACKPROPAGATION NETWORK
The rst equation came directly from (2.14).
For the second equation, the matrix
Proof.
e1 ; sign
h
i
sign(W (t) W (t ; 1)) + e1
contains, as elements, 1 if w`ji (t)w`ji < 0 and zero in rest, so momentum terms are added exactly to
the w`ji requiring it, see proof of (2.14).
✍
Remarks:
➥
➥
While this algorithm uses the same main algorithm as described in section 2.3 (of
course with the required changes) note however that the memory requirement is
four times higher than for standard backpropagation, to store supplementary
and two W .
Arguably, the matrix notation for this algorithm may be less bene cial in terms of
speed. However: there is a bene t of splitting the e ort of implementation into
two levels, a lower one, dealing with matrix operations and a higher one dealing
with the implementation of the algorithm itself. Beyond this an ecient matrix
operations implementation may be already developed for the targeted system (e.g.
an ecient matrix multiplication algorithm may be several times faster than the
classical one when written speci cally for the system used3, there is also the
possibility of taking advantage of the hardware, threaded matrix operation on
multiprocessor systems e.t.c.).
All algorithms presented here may be sees as predictors (of error surface features) from
the simple momentumto the more sophisticated SuperSAB. Based on previous behaviour of
error gradient, they try to predict the future behaviour and change learning path accordingly.
➧ 2.6
2.6.1
Applications
Identity Mapping Network
The network consists of 1 input, 1 hidden and 1 output neurons with 2 weights: w1 and
w2 . See gure 2.6 on the next page.
This particular network, while of little practical usage, it presents some interesting features:
there are only 2 weights so it is possible to visualize exactly the error surface see
gure 2.7 on the facing page;
the error surface have a local maxima and a local minima, note that if the weights are
\trapped" into the local minima the standard backpropagation can't move forward as
rE becomes zero there.
The problem is to con gure w1 and w2 such that the identity mapping is realized for binary
input.
For a fast implementation of a matrix multiplication on a RISC processor , 8 times speed increase, see
[Mos97]. For a multi-threaded matrix multiplication see [McC97].
[BTW95] pp. 48{49
3
2.6.1
2.6. APPLICATIONS
25
w1
w2
input
hidden
Figure 2.6:
output
The identity mapping network
E (w1 ; w2 )
1:237
0:250
10
10
0
w2
;10 ;10
10
0
w1
;10
0
10
0
10
;
w1
local maximum
Figure 2.7:
local minimum
The error surface for identity mapping network.
The output of input neuron is xp = z01 (by notation). The output of hidden neuron z11 is
(see (2.3)):
z11
=
1
;
1 + exp(
cw1 z01 )
The output of the output neuron is:
z21
=
1
1 + exp(
;
cw2 z11 )
=
1 + exp
1
cw2
1+exp(;cw1 z01 )
The identity mapping network tries to map its input to output i.e.
for x1 = z01 = 0 )
for x2 = z01 = 1 )
The square mean error is (2.4), where P
Etot. (w1 ; w2 )
=
1
2
P X
1
X
p=1 q=1
=2
[z2q (xp )
;
t1 (z01 )
=0
t2 (z01 )
=1
and NL = 1:
q (xp )]
t
2
w2
26
CHAPTER 2. THE BACKPROPAGATION NETWORK
input
hidden
output
Figure 2.8: The 4-2-4 encoder: 4 inputs, 2 hidden, 4 outputs.
# "
"
#
1
= 21 1 + exp1; ;cw2 +
1 + exp
2
;cw2
1+exp(;cw1 )
2
;1
2
For c = 1 the error surface is shown in gure 2.7 on the page before. The surface have a
local minimum and a local maximum.
2.6.2
The Encoder
This network is also an identity mapping ANN (targets are the same as inputs) with a single
hidden layer which is smaller in size than the input/output layers. See gure 2.8.
Beyond any possible practical applications, this network shows the following:
the architecture of an backpropagation ANN may be important with respect to its
purpose;
the output of a hidden layer is not necessary meaningless,
the importance of biases.
The input vectors and targets are:
x1
011
B0CC
=B
@0A
0
;
x2
001
B1CC
=B
@0A
;
0
x3
001
B0CC
=B
@1A
;
0
x4
001
B0CC
=B
@0A
1
;
ti
xi
;
i
=1 4
;
The idea is that the inputs have to be \squeezed" trough the bottleneck represented by
hidden layer, before being reproduced at output. The network have to nd a way to encode
the 4-component vectors on a 2-component vector, the output of hidden layer. Obviously
the encoding is given by:
z1
0
= 0
;
z2
0
= 1
;
z3
1
= 0
;
z4
1
= 1
2.6. APPLICATIONS
27
Note that one of xi vectors will be encoded by z1 . On a network without biases this
means that the output layer will receive total input a and considering
the logistic activation
;
function then the corresponding output will always be yT = 0:5 0:5 0:5 0:5 and this
particular output will be weights-independent. One of the input vectors may never be learned
by the encoder. In practice usually (but not always) the net will enter in oscillation trying to
learn two vectors on one encoding so there will be two unlearn vectors. When using biases
this do not happen as the output layer will always receive something weight-dependent.
An ANN was trained with the following parameters:
learning parameter = = 2:5
momentum = = 0:9
at spot elimination = cf = 0:25
and after 200 epochs the outputs of hidden layer became:
z1 =
1
0
; z2 =
0
0
; z3 =
1
1
; z1 =
0
1
and the corresponding output:
0 0:9975229 1
0 0:0000140 1
B 0:0047488 CC ; y = BB 0:9929489 CC ;
y1 = B
2
@ 0:0015689 A
@8:735 10;9A
;
8
1:876 10
0:0045821
0 0:0020816 1
B7:241 10;11CC ;
y3 = B
@ 0:997392 A
0:0010320
✍
Remarks:
➥
➥
07:227 10;111
B 0:0000021 CC
y4 = B
@ 0:0021213 A
0:9960705
The encoders with N1 = log2 N0 are called tight encoders, those with N1 <
log2 N0 are loose and those with N1 > log2 N0 are supertight.
It is possible to train a loose encoder on an ANN without biases as the null vector
doesn't have to be among the outputs of hidden neurons.
CHAPTER
3
The SOM/Kohonen Network
The Kohonen network represents an example of an ANN with unsupervised learning.
➧ 3.1
Network Structure
A SOM (Self Organizing Map, known also as Kohonen) network have one single layer, let
name this one the output layer. The additional input (\sensory") layer just distribute the
inputs to output layer, there is no data processing on it. Into the output layer a lateral
(feedback) interaction is provided (see also section 3.3). The number of neurons on input
layer is N | equal to the dimension of input vector and for output layer is K . See gure 3.1
on the next page.
✍
Remarks:
➥
➥
3.1
Here the output layer have been considered unidimensional. Taking into account
the \mapping" feature of the Kohonen networks the output layer may be considered | more convenient for some particular applications | multidimensional.
A multidimensional output layer may be trivially mapped to a unidimensional one
and the discussion below will remain the same.
E.g. a bidimensional layer K K may be mapped to a unidimensional layer
having K 2 neurons just by nding a function f : K K ! K to do the relabeling/numbering of neurons. Such a function may be e.g. f (j; `) = (` ; 1)K + j
which maps rst row of neurons (1; 1) : : : (1; K ) to the rst K unidimensional
chunk and so on (j; ` = 1; K ).
See [BTW95] pp. 83{89 and [Koh88] pp. 119{124.
29
❖
N
,K
30
CHAPTER 3. THE SOM/KOHONEN NETWORK
input
lateral feedback
layer
output
Figure 3.1:
The Kohonen network structure.
feedback strenght
+ +
;
Figure 3.2:
;
distance between
output neurons
The lateral feedback interaction function of the \mexican
hat" type.
The important thing is that all output neurons receive all components of the
input vector and a little bit of care is to be taken when establishing the neuronal
neighborhood (see below).
In general, the lateral feedback interaction function is indirect | i.e. neurons do not receive
the inputs of their neighbors | and of \mexican hat" type. See gure 3.2. The closest
neurons receive a positive feedback, the more distant ones receive negative feedback and
the far away ones are not a ected.
inter-neuronal
distances
✍
Remarks:
➥
➥
The distance between neuron neighbors in output layer is (obvious) a discrete
one. It may be de ned as being 0 for the neuron itself (auto-feedback), 1 for the
closest neighbors, 2 for the next ones, and so on. On multidimensional output
layers there are several choices, the most obvious one being the Euclidean.
The feedback function determines the quantity by which the weights of neuron
neighbors are updated during the learning process (as well as which weights are
updated).
3.2. TYPES OF NEURONAL LEARNING
➥
➥
➧ 3.2
3.2.1
31
The area a ected by the lateral feedback is named neuronal neighborhood.
For suciently large neighborhoods the distance may be considered continue
when carrying some types of calculations.
neuronal
neighborhood
Types of Neuronal Learning
The Learning Process
Let W = fw g =1
ji
j
;K
=1;N
be the weight matrix and x = fx g =1
i
i
;N
be the input vector for a
i
(output) neuron, i.e. the total input to the (output) layer is a = W x. Note that each row
from W represents the weights associated with one neuron and may be seen as a vector
W (j; :) of the same size and from the same space R as the input vector x. R is named
the weight space.
When an input vector x is presented to the network, the neuron having its associated weight
vector W (k; :) closest to x, i.e. the one for which:
N
N
❖ W (j; :)
weight space
kW (k; :)T ; xk = min kW (j; :)T ; xk
j
=1;K
is declared \winner". All neurons included in its vicinity (neuronal neighborhood), including
itself, will participate to the \learning" of x. The other ones are not a ected.
The learning process consists of changing the weight vectors W (j; :) towards the input
vector (positive feedback) There is also a \forgetting" process which tries to slow down the
progress (negative feedback).
It can be immediately seen why the feedback is indirect: the neurons are a ected by being
in the neuronal neighborhood of the winner, not by receiving directly the winner's output.
Considering a linear learning | changes are restricted to occur only in the direction of a
linear combination of x and W (j; :) for each neuron | then:
dW
dt
= (x; W ) ; (x; W )
where and are scalar (possibly nonlinear) functions, representing the positive feedback,
being the negative one. These two functions have to be build in such a way as to a ect
only the neuronal neighborhood of winning neuron k, which vary in time as the input vector
is function of time x = x(t). Note that here the winning neuron appears implicitly as it
may be determined from x and W .
Various adaptation models (di erential equations for W ) can be build for the neurons (from
the output layer) of the Kohonen network. Some of the more simple ones which may
be analyzed (at least to some extent) analytically are discussed in the following sections.
To simplify further the discussion it will be considered (at this stage) that the neuronal
neighborhood is suciently large to contain the whole network. Later it will be shown
how to limit weight change/adaptation to targeted neurons by using the lateral feedback
function.
3.2
See [Koh88] pp. 92{98.
❖ ,
32
CHAPTER 3. THE SOM/KOHONEN NETWORK
3.2.2 The Trivial Equation
❖
,
One of the most simple equations is a linear di erential:
dW (j; :)
T
= x ; W (j; :) ;
; >0; ;
dt
which in matrix form becomes:
dW
bxT ; W
= 1
dt
and with initial condition W (0) W0 the solution is:
=
const. ; j = 1; K
(3.1)
1 3
2 0Z t
W (t) = 4 1b @ xT (t0 ) e t dt0 A + W0 5 e; t
0
0
which shows that for t ! 1, W (j; :) is the exponentially weighted average of x(t) and do
not produce any interesting e ects.
Proof.
The equation is solved by the method of variation of parameters. First, the homogeneous equation:
have a solution of the form:
dW
dt
W (t) = Ce;
t
+
W
=0
; C = matrix of constants
The general solution for the nonhomogeneous equation (3.1) is found by considering C = C (t). Then:
dW dC ; t
=
e ; C (t) e; t
dt
dt
and by replacing in (3.1) it gives:
dC ; t
e ; C (t) e;
dt
) dC
dt
=
1bxT (t) e t
) C (t) =
and, at t = 0, W (0) = C 0 W0 .
t
1b
=
Zt
1bxT (t) ; C (t) e; t
0
xT (t0 ) e t dt0 + C 0
(
)
C 0 = matrix of constants )
0
3.2.3 The Simple Equation
The simple equation is de ned as:
dW (j; :)
T
= aj (t) x ; W (j; :) ;
; >0; ;
dt
and in matrix notation it becomes:
dW
T
= a(t) x ; W
dt
and consequently, as a = W x then:
dW
T
= W ( xx ; I )
dt
=
const. ; j = 1; K
3.2. TYPES OF NEURONAL LEARNING
33
In the time-discrete approximation:
dW
! Wt = W (t(t++1)1);;Wt (t) = W (t) [ x(t) xT (t) ; I ]
dt
) W (t + 1) = W (t) [ x(t) xT (t) ; I + I ] ; t 2 N + ; W (0) W0 (initial condition)
so the general solution is:
W (t) = W0
;1
tY
t0
✍
=0
x(t0 ) xT (t0 ) ; I + I
(3.2)
Remarks:
➥
For most cases the solution (3.2) is either divergent or converges to zero, both
cases unacceptable. However, for a relatively short time, the simple equation may
approximate a more complicated, asymptotically stable, process.
For t or relatively small, such that the superior order terms O( 2 ) may be neglected, and
considering b = 0 (no \forgetting" e ect) then from (3.2):
"
W (t) ' W0 I +
t;1
X
t0
3.2.4
=0
#
x(t0 ) xT (t0 )
The Riccati Equation
The Riccati equation is de ned as:
dW (j; :)
= xT ; aj W (j; :) ; ; > 0 ; ; = const. ; j = 1; K
(3.3)
dt
and after the replacement aj = W (j; :) x = xT W (j; :)T (note that [W (j; :)]T W (j; :)T
for brevity), it becomes:
dW (j; :)
(3.4)
= xT I ; W (j; :)T W (j; :)
dt
or in matrix notation:
dW
b T ; (W x1
bT ) W
= 1x
(3.5)
dt
Equation (3.3) may be written as: dW
dt = 1bxT ; (a1bT ) W and a = W x.
For general x = x(t), the Riccati equation is not integrable directly (of course beyond the
trivial xT = W (j; :) = 0b). However a statistical approach may be performed.
Proposition 3.2.1. Considering a statistical approach to the Riccati equation (3.4), if there
is a solution, i.e. tlim
!1 W exists, then the solution of W is of the form:
Proof.
lim W =
t!1
r
b
1
hxiT
khxik
if hxi =
6 0b
❖
W (j; :)T
34
CHAPTER 3. THE SOM/KOHONEN NETWORK
❖
where hxi = EfxjW g = const., independent of W and p
time; i.e. all W (j; :) became parallel
N
= (the Euclidean metric being
with hxi in R and will have the norm kW (j; :)k =
used here).
As x and W (j; :) may be seen as vectors in R space, then let be the angle between them. From
the scalar product, cos is:
W (j; :) x
cos =
kW (j; :)k kxk
❖
kxk, kW (j; :)k
N
Proof.
where kxk2 = xT x and kW (j; :)k2 = W (j; :) W T (j; :), the Euclidean metric being used here.
When bringing under the EfjW g operator, as x is obviously independent of W (is the input vector), it goes
to x ! hxi. Then the expected value of d cos =dt is:
(
( :)hxi
W (j; :)hxi k ( :)k
W = E
;
E d cos
dt
kW (j; :)k khxik kW (j; :)k2 khxik W
dW j;
d W j;
dt
❐
Term
(3.6)
xi
dW (j;:)h
dt
kW (j;:)k khxik
x
dW (j;:)h i
dt
W (j;:)k kh ik
First k
by x:
)
dt
x
dW (j;:)
hxi
dt
= kW (j;
:)k khxik
, as hxi is time independent. Then, by multiplying (3.4) to the right
dW (j; :) x = xT x ; xT W T (j; :) W (j; :) x = kxk2 ; [W (j; :) x]2
dt
(as for two matrices (AB )T = B T AT is true), and then this term becomes:
( :)hxi
khxik2 ; [W (j; :) hxi]2
=
kW (j; :)k khxik
kW (j; :)k khxik
(as x ! hxi under EfjW g operator).
dW j;
dt
❐
Term
xi dkWdt(j;:)k
kW (j;:)k2 khxik
W (j;:)h
First the derivative dkW (j; :)k=dt is found as follows:
dkW (j; :)k2
dt
=
8
<2
kW (j; :)k k ( :)k
( :)
:
2
W (j; :)T (because kW (j; :)k2 = W (j; :) W (j; :)T )
dkW (j; :)k = dW (j; :) W (j; :)T
dt
dt kW (j; :)k
d W j;
dt
dW j;
)
dt
and by using (3.4) then
dkW (j; :)k = h xT ; xT W (j; :)T W (j; :)i W (j; :)T
dt
kW (j; :)k
h
i
1
=
xT W (j; :)T ; xT W (j; :)T W (j; :) W (j; :)T
kW (j; :)k
1
xT W (j; :)T
h
; W (j; :) W (j; :)T
i
kW (j; :)k
W (j; :) x
2
(3.7)
=
kW (j; :)k ; kW (j; :)k
(as xT W (j; :)T = W (j; :) x and W (j; :) W (j; :)T kW (j; :)k2 ). By replacing back into the wanted term
and as x ! hxi:
( :)hxi
[W (j; :)hxi]2 ( ; kW (j; :)k2 )
=
kW (j; :)k khxik
kW (j; :)k3 khxik
=
dW j;
dt
3.2. TYPES OF NEURONAL LEARNING
35
Replacing back into (3.6) gives
khxik2 ; [W (j; :) hxi]2 ; [W (j; :)hxi]2 ( ; kW (j; :)k2 ) W
W =E
E d cos
dt
kW (j; :)k khxik
kW (j; :)k3 khxik
2
2
[W (j; :) hxi]2 W
= E khxik kW (j; :)k ;
kW (j; :)k3 khxik
2
2
= khxik E 1 ; cos W = khxik E sin W
(3.8)
kW (j; :)k
kW (j; :)k
Existence of tlim
!1 W means that stabilizes in time and then its derivative limit is zero and the expected
value of the derivative is also zero (as it will remain zero after reaching the limit):
d cos
d cos
=
0
=
E
W
lim
t!1 dt
dt
By using (3.8), it follows immediately that Efsin jW g = 0 and then EfjW g = 0, i.e. all EfW (j; :)jW g
are parallel to hxi.
The norm of W (j; :) is found from (3.7). If tlim
!1 W (j; :) does exists then the expectation of dkW (j; :)k=dt
have to be zero:
E dkWdt(j; :)k W = E kWW(j;(j;:):)xk ; kW (j; :)k2 W = 0
but as W (j; :) 6= 0b, this may happen only if
E ; kW (j; :)k2 W = 0 ) (EfkW (j; :)kg)2 =
\
Finally, combining all previously obtained results:
W (j;:)
hxi
lim cos(W (j; :); hxi) = 1 ) tlim
!1 kW (j;:)k = khxik
q
: lim kW (j; :)k =
t!1
8
<
t!1
3.2.5
T
)
lim W (j; :) =
t!1
r
hxiT
khxik
More General Equations
Theorem 3.2.1. Let > 0, a = W x and (a) an arbitrary function such that Ef (a)jW g
exists. Let x = x(t) a vector with stationary statistical properties (and independent of W ).
Then, if a learning model (process) of type:
dW (j; :)
= xT ; (aj ) W (j; :) ; j = 1; K
dt
or, in matrix notation:
i
h
dW
W
=
1bxT ; (a) 1bT
dt
have nonzero bounded W solutions for t ! 1, then it must be of the form:
tlim
!1 W
/ 1b hxiT
where hxi is the mean of x(t); i.e. all W (j; :) become parallel to hxi in RN .
(3.9)
❖
, a,
36
CHAPTER 3. THE SOM/KOHONEN NETWORK
Let be the angle between hxi and W (j; :) vectors in R then, from the scalar product: cos =
hxi . The Efd cos =dtjW g is calculated in similar way as in proof of proposition 3.2.1.
:)kkhxik
N
Proof.
k
W (j;:)
W (j;
E
d cos
W =E
dt
(
hxi ; [W (j; :)hxi] k ( :)k W
kW (j; :)k khxik
kW (j; :)k2 khxik
dW (j;:)
d W j;
dt
)
dt
(3.10)
Multiplying (3.9) by x, to the right, gives:
dW (j; :)
x = xT x ; (aj ) W (j; :) x =
dt
kxk2 ; (a ) [W (j; :) x]
j
(3.11)
The dkW (j; :)k=dt derivative is calculated in similar way as in proof of proposition 3.2.1 to give:
dkW (j; :)k dW (j; :) W (j; :)T
= dt kW (j; :)k
dt
and then, by using (3.9):
dkW (j; :)k
xT W (j; :)T
=
dt
kW (j; :)k
T
; (a ) W (kj;W:)(Wj; :)(j;k :) = [kWW((j;j;:):)xk] ; (a ) kW (j; :)k
j
j
(3.12)
The (3.11) and (3.12) results are used in (3.10) (and also x ! hxi) to give:
khxik2 ; (a ) [W (j; :)hxi]
E d cos
W =E
dt
kW (j; :)k khxik
[W (j; :)hxi] [ k ( (:) :)hxki] ; (a ) kW (j; :)k
;
W
kW (j; :)k2 khxik
j
W j;
j
W j;
and, after simpli cation, it becomes:
khxik2 kW (j; :)k2 ; [W (j; :)hxi]2 W
W = E
E d cos
dt
kW (j; :)k3 khxik
Existence of lim
!1 W means that the stabilizes in time and then lim
!1(d cos =dt) = 0, and then the
expected value is zero as well: Efd cos =dtjW g = 0. But this may happen only if:
t
t
E d cos
W = 0 , E khxik2 kW (j; :)k2 ; [W (j; :)hxi]2 W = 0 ,
dt
E kWW(j;(j;:):)k hkhxxi ik W = 1 , E f cos j W g = 1
T
i.e. lim
!1 = 0 and then W (j; :) and hxi become parallel for t ! 1, i.e. lim
!1 W (j; :) / hxi .
t
❖ hxxT i, max ,
umax
t
Theorem 3.2.2. Let > 0, a = W x and (a) an arbitrary function such that Ef (a)jW g
exists. Let hxxT i = EfxxT jW g (in fact xxT does not depend on W as is the covariance
matrix of input vector). Let max = max
` the maximum eigenvalue of hxxT i and umax
`
the associated eigenvector.
Then, if a learning model (process) of type:
dW (j; :)
= aj xT ; (aj )W (j; :)
(3.13)
dt
or, in matrix notation:
i
h
dW
W
=
axT ; (a) 1bT
dt
3.2.5
See [Koh88] pp. 98{101.
❖
3.2. TYPES OF NEURONAL LEARNING
37
have nonzero bounded W solutions for t ! 1, they have to be of the form:
/ 1buTmax
6= 0b, where (0) 0 ; i.e. all
W
provided that
in RN .
W
umax
W
W (j;
W
:)
become parallel to umax
Let be an eigenvalue and u the corresponding eigenvector of hxxT i such that hxxT iu = u .
Let be the angle between W (j; :) and u such that cos = ( ( :):) uu` ` .
Proof.
`
`
`
`
`
`
kW j;
kk
E
d cos `
W =E
dt
(
`
k
Efd cos =dtjW g is calculated the same way as in proof of theorem 3.2.1:
`
`
W j;
u`
[W (j; :) u` ] dkWdt(j;:)k
;
kW (j; :)k ku` k
kW (j; :)k2 ku` k W
dW (j;:)
)
(3.14)
dt
Note that xxT ! hxxT i when passing under the EfjW g operator.
From (3.13), knowing that a = W (j; :) x then:
j
dW (j; :)
= W (j; :) xxT ; (aj ) W (j; :)
dt
then, multiplying by u` to the right and knowing that hxxT iu` = ` u` , it follows that
dW (j; :)
u` = W (j; :) xxT u` ; (aj ) W (j; :) u`
dt
under
;;;;;!
W (j; :)hxxT iu` ; (aj ) W (j; :) u` = ( ` ; (aj )) [W (j; :) u` ]
E fjW g
(3.15)
The other required term is dkW (j; :)k=dt which again is calculated in similar way as in proof of theorem 3.2.1:
dkW (j; :)k dW (j; :) W (j; :)T
= dt kW (j; :)k
dt
and, by using (3.13), aj = W (j; :) x and W (j; :) W (j; :)T = kW (j; :)k2 , then:
W (j; :) xxT W (j; :)T
W (j; :) W (j; :)T
dkW (j; :)k
=
;
(
aj )
dt
kW (j; :)k
kW (j; :)k
T iW (j; :)T
W
(
j;
:)
h
xx
under
;;;;;!
; (aj ) kW (j; :)k
E fjW g
kW (j; :)k
(3.16)
Replacing (3.15) and (3.16) results back into (3.14) gives:
( ; (a )) [W (j; :) u ]
E d cos
W =E
dt
kW (j; :)k ku k
j
`
`
`
`
;
[W (j; :) u ]
`
xxT iW (j;:)T
W (j;:) h
kW (j;:)k
kW (j; :)k2 ku k
; (a ) kW (j; :)k
j
`
and, after simpli cation, it may be written as:
W
[W (j; :) u ] W
W (j; :) hxxT iW (j; :)T
E d cos
W = E ;
dt
kW (j; :)k2
kW (j; :)k ku k
`
`
`
`
Lets take u = umax and the corresponding = max . The above formula becomes:
E
`
d cos max
W =
dt
E
`
W (j; :)hxxT iW (j; :)T
max ;
kW (j; :)k2
[W (j; :) umax ] W
kW (j; :)k kumax k
The existence of lim W means that max stabilizes in time and thus lim d cos max =dt = 0 and so is its
expected value. As W (j; :) umax 6= 0 then:
t!1
t!1
hxxT iW (j; :)T W = 0 , E W (j; :)hxxT iW (j; :)T W = max (3.17)
E max ; W (j; :)kW
(j; :)k2
kW (j; :)k2
❖
W0
` , u`
❖ `
❖
38
CHAPTER 3. THE SOM/KOHONEN NETWORK
the equality being possible only for
appendix).
max ,
in accordance to the Rayleigh quotient (See the mathematical
As the matrix xxT is symmetrical (and so is hxxT i, i.e. hxxT i = (hxxT i)T ) then an orthogonal set of
eigenvectors may be build (see the mathematical appendix). A transformation of coordinates to the system
fu g =1 may be performed by using the matrix U build using the set of eigenvectors as columns (and then
U T U = I as uT u = , being the Kroneker symbol). Then W (j; :)T ! W 0 (j; :)T = U W (j; :)T ,
W (j; :) ! W 0 (j; :) = W (j; :) U T , also
`
`
;N
`
k
`k
`k
kW (j; :)k2 ! kW 0(j; :)k2 = W (j; :)U T U W (j; :)T = W (j; :)IW (j; :)T = kW (j; :)k2
and W 0 (j; :) may be represented as a linear combination of fu g:
0 ( :) =
W j;
`
X
! ` uT
`
`
!`
being the coecients of the linear combination (u appear transposed because W (j; :) is a row matrix).
`
Knowing that U T hxxT iU is a diagonal matrix with eigenvalues on the main diagonal (and all others being
zero, see again the mathematical appendix) and using again the orthogonality of fu g (i.e. uT u = )
then:
`
0 ( :)hxxT iW 0 (j; :)T =
W j;
X
` !`2
and
`
Replacing back into (3.17) (with W 0 (j; :) replacing W (j; :)) gives:
E
which may happen only if all
T
!max uT
max / umax .
❖
W0
, P?umax
!`
( P
`
P
` !`2
W
!2
`
kW 0(j; :)k2 =
X
`
k
`k
!`2
`
)
= max
`
! 0 except the one !max corresponding to umax , i.e. lim
!1 W (j; :) =
t
At rst glance the condition W umax 6= 0 at all t, met in theorem 3.2.2 seems to be very
hard. In fact it is, but in practice have a smaller importance and this deserves a discussion.
First, the initial value of W , let W (0) W0 be that one, should be chosen such that
W0 umax 6= 0. But W (j; :)umax 6= 0 means that W (j; :) 6? umax , i.e. W (j; :) is not contained
in a hyperplane P?umax perpendicular on umax (in R ). Even a random selection on W0
would have good chances to stand this condition, even more so as umax is seldom known
exactly in practice, being dependent on the stochastic input variable x (an exact knowledge
of umax would mean a knowledge of an in nite series of x(t)).
Second, theorem 3.2.2 says that, statistically, W (j; :) vectors will move towards either umax
or ;umax depending upon what side of P?umax is W0 (j; :), i.e. away from P?umax . However
the proof is statistical in nature and there is a small but nite probability that, at same t,
W (j; :)(t) falls into P?umax . What happens then, tell us (3.15): as W (j; :)umax = 0 then
( :)
umax = 0 and this means that all further changes in W (j; :) are contained in P?umax ,
i.e. W (j; :) becomes trapped in P?umax .
The conclusion is then that the condition W umax 6= 0, 8t, may be neglected, with the
remark that there is a small probability that some W (j; :) weight vectors may become
trapped and then learning will be incomplete.
N
dW j;
dt
3.3. NETWORK DYNAMICS
➧ 3.3
39
Network Dynamics
3.3.1
Network Running Function
As previously discussed, x; W (j; :) 2 R , the weight space.
For each input vector x, the neuron k for which:
kx ; W (k; :)T k = min kx ; W (j; :)T k
N
j
=1;K
is declared winner, i.e. the one with for which the associated weight vector W (k; :) is closest
to x (in weight space). The winner is used to decide which weights get changed using the
current input vector x. All and only the neurons found into the winner's neighborhood
participate to learning, i.e. will have their weights changed/adapted. All other weights
remain unchanged at this stage; later a new input vector may change the winner and thus
the area of change.
✍
Remarks:
➥
kx ; ( :)T k is the (mathematical) distance between vectors
s x and ( :).
W j;
This distance is user de nable but most used is the Euclidean
➥
➥
P(
W j;
N
=1
i
xi
;
wji
).
As the learning of the network is unsupervised, i.e. there are no targets.
If the input vectors x and weight vectors fW (j; :)g =1 are normalized kxk =
kW (j; :)k (to the same value, not necessary 1), e.g. in an Euclidean space:
j
v
u
u
tX
N
=1
2
xi
v
u
uX
=t
i
N
i
=1
2
wji
;
j
;K
=1
;K
i.e. x(t) and W (j; :) are points on a hyper-sphere in R , then the dot vector
product can be used to nd the matching. The winner neuron is that k one for
which:
W (k; :) x = max W (j; :) x
N
j
=i;K
i.e. the winner is that neuron for which the weight vector W (k; :) points to the
closest direction to that one to which points the input vector x.
This operation is a little faster as it skips a subtraction operation of type: x ;
T
W (j; :) , however it requires normalization of x and W (j; :) which is not always
desirable in practice.
3.3.2
Network learning function
The learning process is an unsupervised one. Time is considered to be discrete t = 1; 2; : : : .
The weights are time dependent W = W (t). The learning network is feed with data x(t).
At time t = 0 the weights are initialized with (small) random values. The weights at time
t are updated as follows:
40
CHAPTER 3. THE SOM/KOHONEN NETWORK
➀
➁
For x(t) nd the winning (output) neuron k. See section 3.3.1.
Update weights according to the model chosen, see section 3.2 for a selection of
learning models:
dW
=
dW
dt
dt
which in discrete time approximation (dt ! t ; (t ; 1) = 1) becomes:
W = W (t) ; W (t ; 1) = dW
dt
(3.18)
3.3.3 Initialization and Stop condition
Weights are initialized (in practice) with small random values (normalized or not) and the
adjusting process continue by iteration.
The stopping of the learning process may be done by one of the following methods:
➀ choosing a xed number of steps t = 1; T .
➁ the learning process continue until the adjusting quantity wji = wji (t +1) ; wji (t)
falls under some speci ed value, i.e. wji 6 ", where " is the threshold.
3.3.4 Remarks
The mapping feature of Kohonen networks
SOM
Due to the fact that the Kohonen algorithm \moves" the weights vectors towards the input
vectors the Kohonen network tries to map the input vectors, i.e. the weights vectors will try
to copy the topology of input vectors in the weight space. The mapping occurs in the
weight space. See section 3.5.2 for an example. For this reason Kohonen networks are
also called self ordering maps or SOM.
Activation Function
Note that the activation function, as well as the neuronal output is irrelevant to the learning
process.
Incomplete Learning
Even if the learning is unsupervised, in fact, a poor choice of learning parameters may lead to
an incomplete learning (so, in fact, a full successful learning is \supervised" at a \highest"
level). See section 3.5.2 and gure 3.6 for an example of an incomplete learning.
➧ 3.4
The algorithm
1. For all neurons in output layer: initialize weights with random values.
2. If working with normalized vectors then normalize the weights.
3. Choose a model | type of neuronal learning. See section 3.2 for some examples.
3.5. APPLICATIONS
41
4. Choose a model for the neuronal neighborhood | lateral feedback function. See also
section 3.5 for some examples.
5. Choose a stop condition. See section 3.3.3.
6. Knowing the learning model, the neuronal neighborhood function and the stop condition, build the nal equation giving the weight adaptation formula. See section 3.5.1
for an example of how to do this.
7. In discrete time approximation repeat the following steps till the stop condition is
met:
(a) Get the input vector x(t).
(b) For all neurons j in output layer, nd the \winner" | the neuron k for which:
kx(t) ; W (k; :) k = min kx(t) ; W (j; :) k
T
T
j =1;K
or, if working with normalized vectors:
W (k; :) x = max W (j; :) x
j =1;K
(c) Knowing the winner, change weights by using the adaptation formula built at
step 6.
➧ 3.5
3.5.1
Applications
The Trivial Model with Forgetting Function
This application is without practical value but it shows how to build a weight adaptation
formula. It also gives some examples of neuronal neighborhood models. The topics discussed
here apply to many types of Kohonen networks.
Let choose the trivial equation (3.1) as learning model:
dW
b T; W
= 1x
(3.19)
dt
Next let consider h(k; j ) the function modelling neuronal neighborhood, i.e. the lateral
feedback. This function should be of \mexican hat" type, i.e.
h(k; j )
8
>
<> 0
for j relatively close to k
6 0 for j far, but not too much, from k
>
:
= 0 for j far away from k
This function will be generally a function of \distance" between k and j , the distance being
user de nable. Considering x and x the \coordinates" then h = h(jx ; x j). Let xT( ) =
;
x1 x be the vector containing the neuron coordinates, then h(jx( ) ; x( ) 1bj) will
give the vector containing adaptation height around winner k, for the whole network.
k
K
h
neuronal
neighborhood
lateral feedback
❖
j
j
k
K
K
K k
❖ x(K )
42
CHAPTER 3. THE SOM/KOHONEN NETWORK
h
h+
k + n+
k
k + n+ + n;
j
h;
Figure 3.3:
The simple lateral feedback function.
To account for the neuronal neighborhood model chosen, equation (3.19) have to be changed
to:
i hbT
i
dW h
1x ; W
(3.20)
=
h(jx( ) ; x( ) b1j) 1bT
dt
K
K k
Note that the elements of h(jx( ) ; x( ) b1j) corresponding to neurons outside neuronal
neighborhood of winner are zero and thus, for these neurons: ( :) = 0, i.e. their weights
remain unchanged for current x.
Various neuronal neighborhood may be of the form:
Simple lateral feedback function:
K
K k
dW j;
dt
8h
>>
><;h;
h(j; k) =
>>
>:
0
+
❖
h + , h;
for j 2 fk ; n+ ; : : : ; k; k + n+ g (positive feedback)
for j 2fk ; n+ ; n; ; : : : ; k ; n+ ; 1g [
fk + n+ + 1; : : : ; k + n+ + n; g
(negative feedback)
in rest
where h 2 [0; 1], h = const. de nes the height of positive/negative feedback and
n > 1, n 2 N de nes the neural neighborhood. See gure 3.3.
Exponential lateral feedback function:
2
h(j; k) = h+ e;( ; ) ; for jj ; kj 6 n
k
stop condition
❖
j
where h+ > 0, h+ = const. de nes the positive feedback and there is no negative
one and n > 0, n = const. de nes the neuronal neighborhood. See gure 3.4 on the
facing page.
Finally the stop condition may be implemented by multiplying the right side of equation
(3.20) by a function (t) with the property lim
!1 (t) = 0. This way (3.20) becomes:
h
dW
=
(t) h(jx(
dt
K)
t
;x
b j) b i h b ;
(K )k 1
T
1
1x
T
W
i
(3.21)
3.5. APPLICATIONS
43
1
0
;3
0
3
Figure 3.4: The exponential lateral feedback function
h(x) = e;x
2
.
dW
The convergence of (t) to zero ensures that tlim
!1 dt = 0 and thus the weight adaptation,
i.e. learning process, stops eventually.
Various stop functions may be of the form:
Geometrical progression function:
t
; init ; ratio 2 (0; 1) ; init ; ratio = const.
(t) = init ratio
where init and ratio are the initial/ratio values.
Exponential function:
(t) = init e;f (t)
where f (t) : N ! [0; 1) is a monotone increasing function.
Note that for f (t) = t and t 2 N + this function is a geometrical progression.
In discrete time approximation, using (3.21) into (3.18) gives:
h
W (t + 1) = W (t) + (t) h(jx(K ) ; x(K )k 1bj) 1bT
i
h
b
1x
T
; W
i
(3.22)
(note also that the winner k is also time dependent, i.e. x(K )k = x(K )k (x(t))).
✍
Remarks:
➥
The above considerations are easily extensible to multidimensional Kohonen networks. E.g. for a bidimensional K K layer, considering the winner k, the
interneuronal Euclidean distance, squared, will be:
(x(K ) ; x(K )k 1b) 2 + (y(K ) ; y(K )k 1b)
y(
2
K ) holding the second coordinate. Of course h also changes to:
h = h((x(K ) ; x(K )k 1b)
2
+ (y(K ) ; y(K )k 1b) 2 )
44
CHAPTER 3. THE SOM/KOHONEN NETWORK
3.5.2 Square mapping
❖ X(K )
This example shows the mapping feature of a Kohonen network, a way (among many other
possibilities) to build multidimensional networks and possible defects in learning process.
Let be a Kohonen network with 2 inputs and 8 8 neurons (bidimensional output layer with
K = 8). The trivial equation model, in discrete time approximation, will be used here, see
section 3.5.1.
As the network is bidimensional will do the following changes:
The \x" coordinates of neurons, in terms of interneuronal distances, are kept in a
matrix X(K ) of the form:
X(K ) =
b1 ;1
01 2 81
B1 2 8CC
8 = B
@: : : : : : : : : : : : :A
2
1
❖ Y(K )
while the \y" coordinates are:
Y(K )
011
BB2CC
b
= B.C 1
@ .. A
T
2
8
01 1 11
B2 2 2CC
=B
@: : : : : : : : : : : : :A
8
8
8
8
and it becomes immediately that the layout of network is (coordinates (x; y) are in
parentheses):
(1; 1)
(8; 1)
(1; 8)
(8; 8)
..
.
❖ D (k )
..
.
A particular neuron will be then identi ed by two numbers (jx ; jy ).
Considering (x(K )kx ky ; y(K )kx yy ) the coordinates of winner then the interneuronal
squared distances may be kept in a matrix D(kx ; ky ) as:
D(kx ; ky ) = (X(K )
; x K kx ky e1)
(
)
2
The lateral feedback function is
h((kx ; ky ); t) = h+ exp
K ; y(K )kx ky e1)
+ (Y( )
D(kx ; ky )
; (d
t
init drate )
2
2
where h+ = 0:6, dinit = 5 and drate = 0:993.
The stop function is
t
(t) = init rate
where init = 1 and rate = 0:993.
The weights are kept in two matrices W1 and W2 corresponding to the two components of input vector. The weight vector associated to a particular neuron (jx ; jy ) is
(w1jx jy ; w2jx jy ).
3.5. APPLICATIONS
45
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.0
0.0
0.1
0.2
0.3
0.4
Figure 3.5:
0.5
0.6
0.7
0.8
0.9
1.0
Mapping of the [0; 1] [0; 1] square by an 8
8 network.
These are snapshots taken at t = 0 (upper-left), t = 7
(upper-right), t = 15 (lower-left) and t = 120 (lowerright).
The constants of trivial equation are taken to be
Then the general weights updating formulas are:
W1 (t + 1) = W1 (t) + (t) h((kx ; ky ); t)
W2 (t + 1) = W2 (t) + (t) h((kx ; ky ); t)
=1
h
and
=1
.
x1 (t)e1 ; W1 (t)
h
i
x2 (t)e1 ; W2 (t)
i
The pair of weights associated with each neuron (w1jx jy ; w2jx jy ) may also be represented as
points in the [0; 1] [0; 1] square. A successful learning will try to cover as much as possible
of the area. See gure 3.5 where the evolution of training is shown from t = 0 (upper{left)
to nal stage (down{right). Lines are drawn between closest neighbors (network topology
wise). The weights are the points at intersections.
✍
Remarks:
➥
Even if the learning is unsupervised, in fact, a poor choice of learning parameters
( , h+ , e.t.c.) may lead to an incomplete learning. See gure 3.6 on the
following page: small values of feedback function at the beginning of learning
makes the network to be unable to \deploy" itself fast enough leading to the
46
CHAPTER 3. THE SOM/KOHONEN NETWORK
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
Figure 3.6:
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Incomplete learning of the [0; 1] [0; 1] square by an 8 8
network.
appearance of a \twist' in the mapping. The network was the same as the one
used to generate gure 3.5 including same inputs and same weights initial values.
The only parameters changed were: h+ = 0:35, dinit = 3:5, drate = 0:99 and
rate = 0:9999.
CHAPTER
4
The BAM/Hopfield Memory
This network illustrate an associative memory. Unlike the classical von Neumann systems
where there is no link between memory address and its contents, in ANN, part of information
is used to retrieve the rest associated with it This kind of memory is named associative.
➧ 4.1
Associative Memory
De nition 4.1.1. Let be P pairs of vectors
yp 2 RK , N; K; P 2 N + called exemplars.
f(x1
;
y1 ); : : : (xP ; yP )g with xp
2 RN and
Then the mapping M : RN ! RK is said to implement an heteroassociative memory if:
❖ xp , vp ,
P
heteroassociative
memory
M(xp ) = yp 8 = 1
6=
M(x) = yp 8x such that kx ; xp k kx ; x` k 8 = 1
De nition 4.1.2. Let be pairs of vectors f(x1 y1 )
(xP yP )g with xp 2 RN and
yp 2 RK ,
2 N + called exemplars.
Then the mapping M : RN ! RK is said to implement an interpolative associative interpolative assop
;P
<
P
;
`
;:::
;P; `
p
;
N; K; P
memory if:
M(xp ) = yp
8 =1
8 d ) 9 e such that M(xp + d) = yp + e d 2 RN
i.e. if x 6= xp , then y = M(x) =
6 yp , 8 = 1 .
p
;
p
4.1
;P
See [FS92] pp. 130{131.
47
ciative memory
;P
;
and
e 2 RK ; d; e 6= 0b
48
CHAPTER 4. THE BAM/HOPFIELD MEMORY
The interpolative associative memory may be build from a set of orthonormal set of exemplars fxp g. The M function is then de ned as:
!
P
X
T
(4.1)
M(x) =
yp xp x
p=1
Kroneker symbol
Proof.
f g
Orthogonality of xp means that xTp x` = p` , where p` is the Kroneker symbol.
From equation 4.1:
0P
1
X TA X
@
M(x`) = ypxp x` = P ypxTp x` = XP ypp` = y`
p=1
p=1
p=1
and for some x = x` + d:
M x M x` d
M x` M d ).
( )=
(obviously
M x`
(
+
d) =
(
)+
(
+
y` + e where e =
P
X
p=1
yp xTp d
( )
De nition 4.1.3. Let be a set of
autoassociative
memory
)=
P
vectors fx1 ; : : : xP g with xp 2 RN and N; P 2 N +
called exemplars.
Then, the mapping M : RN ! RN is said to implement an autoassociative memory if:
M(xp ) = xp 8 = 1
M(x) = xp 8x such that kx ; xp k kx ; x` k 8
p
;P
<
`
= 1; P ;
`
6=
p
In general x will be used to denote the input vector and y the output vector into an
associative memory.
➧ 4.2
The BAM Architecture
The BAM (Bidirectional Associative Memory) implements a interpolative associative memory and consists of 2 layers of neurons fully interconnected.
The gure 4.1 on the facing page shows the net as M(x) = y but the input and output
may swap places, i.e. the direction of connection arrows may be reversed and y play the
role of input, using the same weight matrix (but transposed, see below).
Considering the weight matrix W then the network output is y = W x, i.e. the activation
function is identity f (x) = x.
According to (4.1), the weight matrix may be build using a set of orthogonal fxp g and the
associated fyp g, as:
P
X
(4.2)
yp xTp
W =
p=1
4.2
See [FS92] pp. 131{132.
4.3. BAM DYNAMICS
Figure 4.1:
49
x
layer
y
layer
The BAM network structure.
If the fy g are also orthogonal then the network is reversible. Considering y layer as input
then:
p
x = W
Proof.
T
y
As for two matrices it is true that (AB )T = B T AT then from (4.2):
WT =
By using the orthogonality property ypT y`
x $ y).
✍
=
Xx y
P
p=1
p pT
p` , the proof is very similar to the one for (4.1) (replacing
Remarks:
➥
➥
➥
According to (4.2) the weights can be computed exactly (within the limitations
of rounding errors).
The activation function of neurons was assumed to be the identity: f (a) = a.
Because the output function of a neuron should be bounded so should be the
data network is working with (i.e. x and y vectors).
The network can be used as autoassociative memory considering x y. The
weight matrix becomes:
X
P
W
=
p
➧ 4.3
T
xp xp
=1
BAM Dynamics
4.3.1
Network Running
The BAM functionality di er from others by the fact that weights are not adjusted during a training period but calculated from the start from the set of vectors to be stored
fx ; y g =1 .
p
4.3.1
p
p
;P
See [FS92] pp. 132{136.
50
CHAPTER 4. THE BAM/HOPFIELD MEMORY
The procedure is developed for vectors belonging to Hamming space1 H. Due to the fact
that most information can be encoded in binary form this is not a signi cant limitation and
it does improve the reliability and speed of the net.
Bipolar vectors are used (with components having values either +1 or ;1). A transition to
and from binary vectors (having component values either 0 or 1) can be easily done using
the following relation: x = 2xe ; b1 where x is a bipolar (Hamming) vector and xe is a binary
vector.
From a set of vectors fxp ; yp g the weight matrix is calculated by using (4.2). Both fxp g
and fyp g have to be orthogonal because the network works in both directions.
The procedure works in discrete time approximation. An initial x(0) is applied to the input.
The goal is to retrieve a vector y` corresponding to the closest x` to x(0), where fx` ; y` g
are from the exemplars set (stored into the net, at the initialization time, by the mean of
the calculated weight matrix).
The information is propagated forward and back between layers x and y till a stable state
is reached and subsequently a pair fx` ; y` g belonging to the set of exemplars is found (at
the output of x respectively y layers).
The procedure is as follows:
At t = 0 the x(0) is applied to the net and the corresponding y(0) = W x(0) is
calculated.
The outputs of x and y layers are propagated back and forward, till a stable state is
reached, according to the formulas (for convenience [W (:; i)]T W (:; i)T ):
xi (t
yj (t
+ 1) =
+ 1) =
8
>><+1 if
T
) y( )) =
>> ( ) if
:;1 if
8
>>+1
<
:) x( + 1)) =
>> ( )
:;1
f (W (:; i
f (W (j;
t
t
xi t
yj t
T y(t) > 0
W (:; i)
T y(t) = 0
W (:; i)
;
i
= 1; N
T y(t) < 0
W (:; i)
if W (j; :) x(t + 1) > 0
if W (j; :) x(t + 1) = 0
if W (j; :) x(t + 1) < 0
;
j
= 1; K
Note that the activation function f is not the identity.
In matrix notation the formulas may be written as:
x(t + 1) = sign(W T y(t)) + j sign(W T y(t))jC
x(t)
y(t + 1) = sign(W x(t + 1)) + j sign(W x(t + 1))jC
y(t)
(4.3)
0b.
Proof.
sign(W T y(t)) gives the correct (1) values of x(t + 1) for the changing components and
make xi (t + 1) = 0 if W (:; i)T y = 0.
The vector j sign(W y(t))jC have its elements equal to 1 only for those xi components which have
to remain unchanged and thus restores the values of x to the previous ones (only for those elements
requiring it).
1 See math appendix.
and the stable condition means:
sign(W T y(t)) = sign(W x(t + 1)) =
4.3. BAM DYNAMICS
51
The proof for second formula is similar.
Convergence of the process is ensured by theorem 4.3.1.
When working in reverse y(0) is applied to the net, x(0) = W T y(0) is calculated and the
formulas change to:
y(t + 1) = sign(W x(t)) + j sign(W x(t))jC
y(t)
x(t + 1) = sign(W T y(t + 1)) + j sign(W T y(t + 1))jC
(4.4)
x(t)
4.3.2 The BAM Energy Function
De nition 4.3.1. The following function:
E
❖
( x y ) = ;y T
;
W
(4.5)
x
is called BAM energy function2.
Theorem 4.3.1. The BAM energy function have the following properties:
1. Any change in x or y during BAM running results in a decrease in E i.e.:
+1 (x(t + 1); y(t + 1)) 6 E (x(t); y(t))
Et
2.
E
is bounded below by Emin = ;
Pj
t
wji
j;i
j.
3. When E changes it must change by an nite amount, i.e. E = E +1 ; E is nite.
Let consider that just one component of vector y changes from to +1, i.e. k ( +1) =
6 k ( ).
t
Proof.
1.
k
Then from equation (4.5):
t
t
t
y
t
= t+1 ; t =
1 0
0
N
K X
N
N
K
N
X
X
X
X
C ; BB; X
B@; k ( + 1) ki i ;
=B
j ji
j ji i C
k ( ) ki i ;
A @
E
i=1
y
t
w
x
j=1 i=1
j =k
E
y w
E
x
i=1
y
t
w
y
t
y
t
N
X
i=1
x
8
>
<+1
(
+
1)
=
k
>
:;k1( )
t
y w
1
C
iC
A
x
ki i = [ k ( ) ; k ( + 1)] ( :) x
w
According to the updating procedure, see section 4.3.1:
y
j=1 i=1
j =k
t
6
6
= [ k ( ) ; k ( + 1)]
x
y
y
t
y
t
y
t
W k;
if W (k; :) x > 0
if W (k; :) x = 0
if W (k; :) x < 0
As it was assumed that yk changes then there are two cases
yk (t) = +1 and it changes to yk (t + 1) = ;1. Then yk (t) ; yk (t + 1) > 0 and W (k; :) x
(according to the algorithm) so E < 0.
yk (t) = ;1 and it changes to yk (t + 1) = +1. Analogous the preceding case: E < 0.
<
0
See [FS92] pp. 136{141.
This is the Liapunov function for BAM. All state change, with respect to time (x = x(t) and y = y(t))
involves a decrease in the value of the function.
4.3.2
2
E
BAM energy
function
52
CHAPTER 4. THE BAM/HOPFIELD MEMORY
If more than one term is changing then E is of the form:
E = E +1 ; E =
t
t
X
K
k
=1
y W (k; :) x < 0
k
which represents a sum of negative terms.
A similar discussion may be performed for an x change.
2. The fy g
=1 and fx g =1 have all values either +1 or ;1.
j
j
i
;K
i
;N
The lowest possible value for E is obtained when all sum terms y
Emin
=;
X
jy w
j
ji
xi
j=;
j;i
X
j
wji xi
(see (4.5)) are positive.
jy jjw jjx j = ;
j
ji
i
j;i
X
jw j
ji
j;i
The energy function decreases, it doesn't increase, so E 6= +1. On the other hand the energy function
is limited on the low end (according to the second part of the theorem) so it cannot decrease by an in nite
amount: E 6= ;1.
Also the value of E can't be in nitesimally small resulting into an in nite amount of time before it reaches
it's minimum. The minimum amount by which E may change is that k for which W (k; :) x is minimum
and occurs when y is changing; the minimum amount being:
3.
k
E = ;2jW (k; :) xj
(t) ; y (t + 1)j = 2.
Proposition 4.3.1. If the input pattern x` is exactly one of stored fxp g then the correbecause jy
k
k
sponding y` is retrieved.
Theorem 4.3.1 ensures convergence of the process.
According to the procedure, eventually a vector y = sign(W x ) is obtained. The zeros generated by sign
disappear by procedure de nition: previous values of y are kept instead. But as x x = (orthogonality)
then:
Proof.
`
p
j
`
p`
0
0
1
1
X
X
T
y = sign(W x ) = sign @ y x x A = sign @ y A = sign(y ) = y
P
`
p
✍
Remarks:
➥
➥
➥
crosstalk
➥
=1
P
p
p
`
=1
p p`
`
`
p
The existence of BAM energy function with the outlined properties ensures that
the running process is convergent and a for any input vector and solution is
reached in nite time.
If an input vector is slightly di erent from one of the stored, i.e. there is noise in
data, then the corresponding associated vector is eventually retrieved. However
the process is not guaranteed. Results may vary depending upon the amount of
noise and the saturation of memory (see below).
The theoretical upper limit (number of vectors to be stored) of BAM is 2N ;1
(i.e. 2N =2 because the x and ;x carry the same amount of information due to
symmetry). But if the possibility to work with noisy data is sought then the
real capacity is much lower. A crosstalk may appear (a di erent vector from the
desired one is retrieved).
The Hamming vectors are symmetric with respect to the 1 notation. For this
reason an Hamming vector u carry the same information as its complement xC
4.4. THE BAM ALGORITHM
53
and a BAM network stores automatically both, because xC = ;x and yp = W xp ,
8p = 1; P so:
ypC
➥
➧ 4.4
=
;yp = ;
W
xp = W (;xp ) = W xCp
such that the same W matrix is used.
When trying to retrieve a vector, if the initial one x(0) is closer to the complement
of an stored one xCp then the complement pair will be retrieved fxC ; yC g (because
both exemplars and their complements are stored with equal precedence).
The conclusion is that BAM stores the direction of the exemplar vectors and
not their values.
The BAM Algorithm
Network initialization
The weight matrix is calculated directly from the desired set to be stored:
X
P
W
=
p=1
yp xTp
Note that there is no learning process. Weights are directly initialized with their nal values.
Network running forward
The network runs in discrete time approximation. Given x(0), calculate y(0) = W x(0)
Propagate: repeat the following steps
x(t + 1) = sign(W T y(t)) + j sign(W T y(t))jC
x(t)
y(t + 1) = sign(W x(t + 1)) + j sign(W x(t + 1))jC
b
y(t)
till network stabilize, i.e. sign(W T y(t)) = sign(W x(t + 1)) = 0.
Note that in both forward and backward running cases the intermediate vectors
T
W y may not be of Hamming type.
W
x or
Network running backwards
In the same discrete time approximation, given y(0), calculate x()
propagate using the formulas:
y(t + 1) = sign(W x(t)) + j sign(W x(t))jC
=
W
y(t)
x(t + 1) = sign(W T y(t + 1)) + j sign(W T y(t + 1))jC
b
till the network stabilize, i.e. sign(W T y(t)) = sign(W x(t + 1)) = 0.
x(t)
. Then
T y(0)
54
CHAPTER 4. THE BAM/HOPFIELD MEMORY
x layer
y layer
Figure 4.2:
,
The autoassociative memory structure.
input x
Figure 4.3:
➧ 4.5
The Hop eld network structure.
The Hopfield Memory
4.5.1 The Discrete Hop eld Memory
Let consider an autoassociative memory. The weight matrix, build from a set fyp g is:
W=
❖
x
X
P
p=1
T
yp yp
and it's square (K K ) and symmetric (W = W T ).
An autoassociative memory is similar to a BAM with the remark that the 2 layers (x and
y) are identical. In this case the 2 layers may be replaced with one fully interconnected
layer, including a feedback for each neuron | see gure 4.2: the output of each neuron is
connected to the inputs of all neurons, including itself.
The discrete Hop eld memory is build from the autoassociative memory described above by
replacing the autofeedback (feedback from a neuron to itself) by an external input signal x
| see gure 4.3.
The di erences from autoassociative memory or BAM are as follows:
4.5.1
See [FS92] pp. 141{144.
4.5. THE HOPFIELD MEMORY
➀
➁
55
The discrete Hop eld memory is working with binary vectors rather than bipolar ones
| see section 4.3.1 | so here and below the y vectors are considered binary and so
are the input x vectors.
The weight matrix is obtained from the following matrix:
X
P
=1
(2yp
; 1b)(2y ; b1)T
p
p
➂
by replacing the diagonal values with 0 (zeroing the diagonal elements of W is important for a ecient matrix notation3).
The algorithm is similar with the BAM one but the updating formula is:
8
>>+1
>>
>>
<
y (t + 1) = >y (t)
>>
>>0
:
j
j
where the ft g =1
j
j
;K
=
if
if
if
Pw y +x >t
=1
K
ji
i
i
i
j
j
=j
Pw y +x =t
=1
6
K
ji
i
i
i
j
(4.6)
j
=j
Pw y +x <t
=1
6
K
ji
i
i
i
j
j
=j
6
t is named the threshold vector.
❖t
In matrix notation the equation become:
A(t) = sign(W y(t) + x ; t)
y(t + 1) =
Proof.
1
h
2
A(t) + 1b ; jA(t)jC
First as diagonal elements of W are zero (wii
elements of A(t) + b1 are:
i
= 0
+
jA(t)jC
) then
y(t)
K
P
wji yi
ii=1
=j
=
W (j; :)y. Also the
6
8
>
<2 if W (j; :) y + xj > tj
bgj = 1 if W (j; :) y + xj = tj
fA(t) + 1
>
:0 if W (j; :) y + xj < tj
and the elements of jAjC are:
A j
fj jC g =
(
1
0
if W (j; :) y + xj = tj
otherwise
De nition 4.5.1. The following function:
E = ; 12 yT W y ; yT (x ; t)
is named the discrete Hop eld memory energy function.
3
This helps towards an ecient simulation implementation as well.
❖
(4.7)
E
56
CHAPTER 4. THE BAM/HOPFIELD MEMORY
✍
Remarks:
Comparing to the BAM energy function the discrete Hop eld energy function
have a factor of 1=2 because there is just one layer of neurons (in BAM both
forward and backward passes contribute to the energy function).
➥
Theorem 4.5.1. The discrete Hop eld energy function have the following properties:
1. Any change in y (during running) results in a decrease in E :
( ( + 1)) 6 E (y(t))
Et+1 y t
2.
E
is bounded below by:
Emin
t
Xj
= ; 12
wji j ; K
j;i
3. When E changes it must change by an nite amount, i.e. E = E +1 ; E is nite.
Consider that, from t to t + 1, just one component of vector y changes: y . Then from (4.7):
E = E +1 ; E =
t
Proof.
1.
t
k
t
t
X
= 2[y (t + 1) ; y (t)] ; 12 w y ; (x + t )[y (t + 1) ; y (t)]
K
k
k
==1k
i
ki
k
k
k
k
i
i
6
because in the sum
P y w y , y appears twice: once at the left and once at the right and w = w .
K
=1
=
According to the updating procedure (4.6):
j
ji
i
ij
k
i;j
i j
ji
6
8
P
>
>+1 if ; w y ; x + t < 0
>
>
==1
>
<
P
y (t + 1) = >y (t) if ; =1 w y ; x + t = 0
=
>
>
P w y ;x +t > 0
>
>
0
if
;
>
:
==1
K
i
i
6
ki
i
k
k
ki
i
k
k
ki
i
k
k
k
K
k
k
i
i
6
k
K
i
i
6
k
there are 2 cases (it was assumed that y (t + 1) 6= y (t)):
k
k
y (t) = +1 and it changes to y (t+1) = 0. Then [y (t+1);y (t)] < 0 and ; P w y ;x +t > 0
K
k
k
k
==1k
k
i
i
ki
i
k
i
(according to the algorithm) so E < 0.
y (t) = 0 and it changes to y (t + 1) = +1. Analogous the preceding case: E < 0.
If more than one term is changing then E is of the form:
6
k
k
E = E +1 ; E =
t
t
X
K
j
=1
y
X
K
j
=1
=j
!
w y ;x +t < 0
ji
i
j
j
i
i
6
which represents a sum of negative terms.
2.
The fy g =1 have all values either +1 or 0. The lowest possible value for E is obtained when
fy g =1 = 1, the input vector is also x = b1 and the threshold vector is t = 0b such that the negative
i
i
i
;K
i
;K
4.5. THE HOPFIELD MEMORY
57
terms are maximum and the positive term is minimum (see (4.7)), assuming that all w
ji >
min
E
= ; 12
K
X
j;i=1
j =i
j j;
wji
0,
i; j
=1 .
;K
K
6
The energy function decreases, it doesn't increase, so E 6= +1. On the other hand the energy function
is limited on the low end (according to the second part of the theorem) so it cannot decrease by an in nite
amount: E 6= ;1.
Also the value of E can't be in nitesimally small resulting into an in nite amount of time before it reaches
it's minimum. The minimum amount by which E may change is when just one component k is changing,
for which W (k; :)y is minimum, x = 1 and t = 0, the amount being:
3.
k
k
=;
K
X
E
i=1
i=k
;
wki yi
xk
6
(y appears twice: once at the left and once at the right and w
ij
k
✍
=
wji
).
Remarks:
➥
The existence of discrete Hop eld energy function with the outlined properties
ensures that the running process is convergent and a solution is reached in nite
time.
4.5.2 The Continuous Hop eld Memory
The continuous Hop eld memory model is similar to the discrete one except for the activation function of the neuron which is of the form:
f (a) =
1 + tanh(a)
2
= const. ; 2 R+
;
where is called the gain parameter. See gure 4.4 on the next page. The inverse of
activation is:
f (;1) (y ) =
1
2
ln
y
1
;y
(4.8)
See gure 4.5 on the following page.
The di erential equation governing the evolution of continuous Hop eld memory is de ned
as:
K
1
daj
wji yi + xj ; tj aj
=
(4.9)
dt
ii=1
6=j
X
or in matrix notation:
da
dt
4.5.2 See
[FS92] pp. 144{148.
=
Wy + x ;
1
t
a
❖ f , , f (;1)
gain parameter
58
CHAPTER 4. THE BAM/HOPFIELD MEMORY
1:0
f (a)
= 50
=1
a
=3
0:0
;1:0
Figure 4.4:
1:0
The neuron activation function for continuous Hop eld
memory (for di erent values).
1:0
f (;1) (y )
=1
=3
= 50
;1:0
y
0:0
Figure 4.5:
1:0
The inverse of neuron activation function for continuous
Hop eld memory (for di erent values).
In discrete time approximation the updating procedure may be written as:
y(t + 1) = y(t) +
W y(t) + x ;
1
t
ln
y(t)
b ; y(t)
1
y(t)
b ; y(t)]
[1
(of course operations under ln function are done on each yj separately).
Proof.
From (4.9):
df (;1) (yj )
K
X
1
wji yi + xj ; tj f (;1) (yj )
dt
ii=1
6=j
y 1
1 dyj = 1 dyj )
df (;1) (yj ) = d ln j
=
+
1 ; yj
yj 1 ; yj
yj (1 ; yj )
!
K
X
y
1
y (1 ; yj ) dt
dyj =
wji ui + xj ; tj ln j
1
;
yj j
ii=1
6=j
=
and in discrete time approximation dt ! t = t + 1 ; t = 1.
(4.10)
4.5. THE HOPFIELD MEMORY
59
De nition 4.5.2. The following function:
E
=
X
X
;
K
K
;2
1
yj wji yi
xj yj
XZ
+
tj
j =1
i;j =1
6
i=j
yj
K
1
j =1
❖E
f (;1) (y 0 ) dy 0
(4.11)
0
is named the continuous Hop eld memory energy function.
Theorem 4.5.2. The continuous Hop eld energy function have the following properties:
1. Any change in y as a result of running (evolution) results in a decrease in E , i.e.
dE
dt
2.
E
is bounded below by:
Emin
Proof.
1.
First:
Z
ln x
lim
&0 ln x = lim
&0 1
0
x
K
wji j ; K
j;i=1
j =i
6
x
ln
(L'Hospital)
1 ; x dx = ln e
;1 x (1 ; x)1;
x
x
lim
&0 ;x = 0
=
x
Xj
; 21
Z
ln x dx = x ln x ; x )
x
Zyi
=
60
x
x
Zi
0
0
ln y 0 dy0 =
ln 1 ;y y0 dy0 = lim
1;y
0 &0
y
y
y0
= ln y (1 ; y )1; i ; lim
ln y0 0 (1 ; y0 )1; 0 = ln y i (1 ; y )1;
0 &0
;
d
ln y i (1 ; y )1; i = dyd ln y i (1 ; y )1; i dy
=
dt
dt
yi
i
y
i
i
i
i
y
i
y
y
y
i
y
y
y
yi
i
y
i
i
= (ln y ; ln(1 ; y )) dy = f (;1) (y ) dy
dt
dt
i
i
i
then, from (4.11) and using (4.9):
dE
dt
=;
=;
because
df
;1) (yj )
(
dyj
=
1
yj (1
;
yj )
K
X
K
X
j =1
wji ui + xj
i=1
i=j
6
X daj dyj
K
j =1
dt dt
> 0 ( yj
=;
i
i
;1t a
j
!
j
dyj
dt
K
X
df (;1) (yj ) dyj 2
dyj
j =1
2 (0; 1)) such that
dE
dt
dt
<0
is always negative and E decreases in time.
The lowest possible value for E is obtained when fy g =1 = 1, the input vector is also x = 1b and
the threshold vector is t = b0, such that the negative terms are maximum and the positive term is minimum
(see (4.11)), assuming that all w > 0, i; j = 1; K .
2.
j
j
;K
ji
Emin
= ; 21
K
X
jw j ; K
ji
j;i=1
j =i
6
60
CHAPTER 4. THE BAM/HOPFIELD MEMORY
✍
Remarks:
➥
➥
➥
➧ 4.6
The existence of continuous Hop eld energy function with the outlined properties
ensures that the running process is convergent.
While the process is convergent there is no guarantee that the process will converge to the lowest energy value.
For ! +1 then the continuous Hop eld becomes identical to the discrete one.
Otherwise:
b.
For ! 0 then there is only one stable state for the network when y = 12 1
For 2 (0; +1) the stable states are somewhere between the corners of
Hamming hypercube (having its center at 21 1b) and its center such that as
the gain decreases from +1 to 0 the stable points moves from corners
towards the center and at some point they may merge.
Applications
4.6.1 The Traveling Salesperson Problem
This example shows a practical problem of scheduling, e.g. as it arises in communication
networks. A bidimensional Hop eld continuous memory is being used. It is also a classical
example of an NP-problem but solved with the help of an ANN.
The problem: A traveling salesperson must visit a number of cities, each only once. Moving
from one city to other have a cost e.g. the intercity distance associated. The cost/distance
traveled must be minimized. The salesperson have to return to the starting point.
The problem is of NP (non-polynomial) type.
Proof. Assuming that there are K cities there will be K ! paths. For a given tour it doesn't matter which
city is rst (one division by K ) nor does matter the direction (one division by 2). So the number of di erent
of paths is (K ; 1)!=2 (K > 3 otherwise a circuit is not possible).
Adding a new city to the previous set means that now there are K !=2 routes. That means an increase in
the number of paths by a factor of:
(
K !=2
K ; 1)!=2
=
K
so for a arithmetic progression growth of the problem the space of possible solutions grows exponentially.
❖ Ci
Let C1 ; : : : ; CK be the cities involved. To each of the K cities is attached a vector which represents a number, converted to a;binary form, of the order of visiting in the
current tour, i.e.
;
the rst one to be visited have: 1 0 : : : 0 , the second one have: 0 1 0 : : : 0
;
and so on, the last one to be visited having attached the vector: 0 : : : 0 1 ; i.e. each
vector have one digit \1", all other elements being \0" (this format is di erent from the
binary representation of the city number j as cities are not visited in their numbering order).
Having the cities C1 ; : : : ; CK , a squared matrix can be build using their associate vectors
as rows. Because the cities aren't necessary visited in their listed order the matrix is not
4.6.1
See [FS92] pp. 148{156.
4.6. APPLICATIONS
61
necessary diagonal.
1
: : : j1 : : : j2 : : : K
::: 0 ::: 0 ::: 0
::::::::::::::::::::::::::::
0 :::
1
::: 0 ::: 0
1
..
.
..
.
::: 0 ::: 1 ::: 0
::::::::::::::::::::::::::::
0 :::
0
::: 0 ::: 1
0
C1
..
.
Ck1
(4.12)
..
.
Ck2
..
.
CK
This matrix de nes the tour: for each column j = 1; K pickup as next city the one having
the corresponding row element equal to 1.
The idea is to build a bidimensional Hop eld memory such that its output is a matrix Y
(not a vector) having the layout (4.12) and this will give the solution (as each row will
represent the visiting order number in binary format of the respective city).
In order to be an acceptable solution, the Y matrix have to have the following properties:
➀ Each city must not be visited more that once , Each row of the matrix (4.12) should
have no more that one \1", all others elements should be \0".
➁ Two cities can not be visited at the same time (can't have the same order number)
, Each column of the matrix (4.12) should have no more than one \1" all others
elements should be \0".
➂ All cities should be visited , Each row or column of the matrix (4.12) should have
at least one \1".
➃ The total distance/cost of the tour should be minimised. Let dk1 k2 be the distance/cost between cities Ck1 and Ck2 .
As the network is bidimensional, each weight have 4 subscripts: wk2 j2 k1 j1 is the weight
from neuron frow k1 ; column j1 g to neuron frow k2 ; column j2 g. See gure 4.6 on the
next page.
✍
❖ dk1 k2
Remarks:
When using bidimensional Hop eld networks all the established results will be kept
but each subscript will split in 2 giving the row and column position (instead of
one giving the column position).
The weights cannot be build from a set of some fY`g as these are not known (the idea is
to nd them) but they may be build considering the following reasoning:
➀ A city must appear only once in a tour: this means that one neuron on a row must
inhibit all others on the same row such that in the end only one will have active output
1, all others will have 0. Then the weight should have a term of the form:
➥
wk(1)2 j2 k1 j1 = ;Ak1 k2 (1 ; j1 j2 ) ; A 2 R+ ; A = const.
where
is the Kroneker symbol. This means all wk(1)2 j2 k1 j1
= 0
for neurons on
❖
A
62
CHAPTER 4. THE BAM/HOPFIELD MEMORY
j1
1
j2
K
C1
w11k1 j1
w1K k1 j1
Ck1
wKKk1 j1
C k2
wk2 j2 k1 j1
wK 1k1 j1
CK
Figure 4.6:
The bidimensional Hop eld memory and its weight representation.
di erent rows, w(1)1 2 1 1 < 0 for a given row k1 if j1 6= j2 and w(1)1 1 1 1 = 0 for
feedback.
There must be no cities with the same order number in a tour: this means that one
neuron on a column must inhibit all others on the same column such that in the end
only one will have active output 1, all others will have 0. Then the weight should
have a term of the form:
k j k j
k j k j
➁
❖B
(2)
wk2 j2 k1 j1 =
;B 1 2 (1 ; 1
k k2 )
j j
;
B
2R
+
; B=
const.
This means all w(2)2 2 1 1 = 0 for neurons on di erent columns, w(2)2 1
given column if k1 6= k2 and w(2)1 1 1 1 = 0 for feedback.
k j k1 j1
k j k j
<0
for a
k j k j
➂
❖C
Most of the neurons should have output 0 so a global inhibition may be used. Then
the weight should have a term of the form:
(3)
wk2 j2 k1 j1 =
➃
❖D
C
2R
+
; C = const.
i.e. all neurons receive the same global inhibition / C .
The total distance/cost have to be minimized so neurons receive a inhibitory input
proportional with the distance between cities represented by them. Only neurons on
adjacent columns, representing cities which may came before or after the current city
in the tour order (only one will actually be selected) should receive this inhibition:
(4)
wk2 j2 k1 j1 =
(
;Dd 1 2 ( 1
k k
0
j ;j2
+1
;
+ j1 ;j200 1 ) ;
2R
(
D
+
; D = const.
if j1 = 1 and j2 = K
K + 1 if j1 = K and j2 = 1
and j200 =
j2 in rest
j2
in rest
to take care of special cases j1 = 1 and j1 = K . The term 1 2 +1 takes care of
the case when column j2 came before j1 (column K came \before" column 1) while
1 2 ;1 operate similar for the case when j2 came after j1 .
where j20
❖ j20 , j200
;C ;
=
0
j ;j
j ;j 00
0
4.6. APPLICATIONS
63
Finally, the weights are:
w22
k j k1 j1
=
w(1)2 2
=
;A 1
k j k1 j 1
w(2)2 2
+
k k2 (1
+
k j k1 j1
w(3)2 2
+
k j k1 j1
w(4)2 2
(4.13)
k j k1 j1
; 1 2 ) ; B 1 2 (1 ; 1 2 ) ; C ; Dd 1 2 ( 1
k k
k k
j j
j j
0
j ;j2 +1
+
1 2 ;1 )
j ;j 00
Taking Y = fy g be the network output (matrix) and considering X = C e1 as input (again
matrix) and T = e0 as the threshold (matrix again) then from the de nition of continuous
Hop eld energy function (4.11) and using the weights (4.13):
kj
E = 12 A
+
X X
K
K
y 1y
kj2 +
kj
k=1 j1 ;j2 =1
j1 =j2
6
C
2
X X
K
1
; CK
K
B
2
k1 ;k2 =1 j1 ;j2 =1
XX
K
K
K
y1y2
k j
k j
+
k1 ;k2 =1 j =1
k1 =k2
6
X X
K
y 1 1 y 2 2 + 12 D
k j
K
X X
1
K
k j
d 1 2 y 1 (y 2
k j
k k
k ;j 0 +1
+
y 2 ;1 )
k ;j 00
k1 ;k2 =1 j =1
6
k1 =k2
y
kj
k=1 j =1
=
1
A
2
X X
K
K
y
kj1 i
u
kj2
+
k=1 j1 ;j2 =1
j1 =j2
6
B
2
1
0
1
X
X
1
+ C@
y ; KA
2
K
K
kj
=
+
where j 0 =
K
y1y2
k j
k j
+
j =1 k1 ;k2 =1
k1 =k2
6
D
2
1
X X
K
K
d 1 2 y 1 (y 2 ;1 + y 2
k k
k j
k ;j 0
k ;j 00 +1 )
k1 ;k2 =1 j =1
k1 =k2
6
2
E1 + E2 + E3 + E4 ; 21 CK 2
(
K
2
k=1 j =1
; 12 CK
X X
(4.14)
(
j if j 6= 1 and j 00 = j if j 6= K .
K if j = 1
1 if j = K
❖
According to theorem 4.5.2, during network running the energy (4.14) decreases and reaches
a minima. This may be interpreted as follows:
➀ Energy minimum will favor states that have each city only once in the tour:
E1 = 21 A
X X
K
K
y 1y
kj2
kj
k=1 j1 ;j2 =1
j1 =j2
6
which reaches minimum E1 = 0 if and only if each city appears only once in the tour
such that the products y 1 y 2 are either of type 1 0 or 0 0, i.e. there is only one
1 in each row of the matrix (4.12).
The 1=2 factor means that the terms y 1 y 2 = y 2 y 1 will be added only once,
not twice.
kj
kj
kj
kj
kj
kj
j 0 , j 00
64
CHAPTER 4. THE BAM/HOPFIELD MEMORY
➁
Energy minimum will favor states that have each position of the tour occupied by
only one city, i.e. if city C 1 is the k2 -th to be visited then any other city can't be in
the same k2 -th position in the tour:
k
E2
=
1
2
X X
K
B
j
K
=1
yk1 j yk2 j
6=
which reaches minimum E2 = 0 if and only if each city have di erent order number
in the tour such that the products y 1 y 2 are either of type 1 0 or 0 0, i.e. there
is only one 1 in each column of the matrix (4.12).
The 1=2 factor means that the terms y 1 y 2 = y 2 y 1 will be added only once,
not twice.
Energy minimum will favor states that have all cities in the tour:
k j
k j
k j
➂
=1
k1 ;k2
k1 k2
k j
k j
k j
0
12
X
X
1
y ; KA
E3 = C @
2
K
k
K
kj
=1 =1
j
reaching minimum E3 = 0 if all cities are represented in the tour, i.e.
P Py
K
K
k
=1 =1
kj
=
K
j
| the fact that if a city was present, it was once and only once was taken care in
previous terms (there are K and only K \ones" in the whole matrix (4.12)).
The squaring shows that the module of the di erence is important (otherwise energy
may decrease for an increase of the number of missed cities, i.e. either
➃
P Py 7K
K
k
K
=1 =1
kj
j
is bad).
Energy minimum will favor states with minimum distance/cost of the tour:
E4
=
1
2
X X
K
D
K
=1 =1
6
=
dk1 k2 yk1 j (yk2 ;j 0 +1 + yk2 ;j 00 ;1 )
j
k1 ;k2
K 1 k2
If y 1 = 0 then no distance will be added. If y 1 = 1 then 2 cases arises:
(a) y 2 +1 = 1 that means that the city C 2 is the next one in the tour and the
distance d 1 2 will be added: d 1 2 y 1 y 2 +1 = d 1 2 .
(b) y 2 +1 = 0 that means that the city C 2 isn't the next one on the tour and the
corresponding distance will not be added: d 1 2 y 1 y 2 +1 = 0.
Similar discussion for y 2 ;1 .
The 1=2 means that the distances d 1 2 = d 2 1 will be added only once, not twice.
From previous terms it should be only one digit \1" on each row so a distance d 1 2
should appear only once (the factor 1=2 was considered).
k j
k j
k ;j
k
k k
k k
k ;j
k j
k ;j
k k
k
k k
k j
k ;j
k ;j
k k
k k
k k
➄
The term ; 21 CK 2 is just an additive constant, used to create the square in E3 .
4.6. APPLICATIONS
65
To be able to use the running procedure (4.10), a way to convert fw 2 2 1 1 g to a matrix
and fy g to a vector should be found. As indices fk; j g work in pairs this can be easily
done using the lexicographic convention:
k j k j
kj
! we( 2 ;1) + 2 ( 1 ;1) + 1
y ! ye( ;1) +
The graphical representation of Y ! ye transformation is very simple: take each column of
Y T and \glue" it at the bottom of previous one. The inverse transformation of ye to get Y
w22
k j k1 j1
kj
is also very simple:
k
k
ye ! y mod K
`
K
K
`
j ; k
K
j
j
+ 1; ` ; K mod K `
The updating formula (4.10) then becomes:
h
i
fy
b
e (t) + C 1
ye(t + 1) = ye (t) + W
ye (t)
b
[1
and the A, B , C and D constants are used to tune the process.
; ye(t)]
5
CHAPTER
The Counterpropagation Network
The counterpropagation network | CPN | is an example of an ANN interconnectivity.
From some subnetworks, a new one is created, to form a reversible, heteroassociative memory1 .
➧ 5.1
The CPN Architecture
Let be a set of vector pairs (x1 ; y1 ); : : : ; (xP ; yP ) who may be classi ed into several classes
C1 ; : : : ; CH . The CPN associate an input x vector with an hyik 2 Ck for which the
corresponding hxik is closest to input x. hxik and hyik are the averages over those xp ,
respectively yp who are from the same class Ck .
CPN may also work in reverse, inputting an y and retrieving an hxik .
The CPN architecture consists of 5 layers on 3 levels. The input level contains x and y
layers; the middle level contains the hidden layer and the output level contains the x and
y layers. See
gure 5.1 on the following page. Note that each neuron on x layer, input
level, receive the whole x (and similar for y layer) and also there are direct links between
input and output levels.
0
0
Considering a trained network an x vector is applied, y = 0b at input level and the corresponding vector hyik is retrieved. When running in reverse an y vector is applied, x = b0 at
input level and the hyik is retrieved. Both cases are identical, only the rst will be discussed
in detail.
1
5.1
CPN
See "The BAM/Hop eld Memory" chapter for de nition.
See [FS92] pp. 213{234.
67
❖
Ck , hxik , hyik
68
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
z
xp
}|
{
z
yp
}|
{
input
level
hidden
level
output
level
|
{z
0
xp
}
|
{z
}
0
yp
Figure 5.1: The CPN architecture.
This functionality is achieved as follows:
The rst level normalize the input vector.
The second level (hidden layer) does a classi cation of input vector, outputting an oneof-k encoded classi cation, i.e. the outputs of all hidden neurons are zero with the
exception of one: and the number/label of its corresponding neuron identi es the input
vector as belonging to a class (as being closest to some particular, \representative",
previously stored, vector).
Based on the classi cation performed on hidden layer, the output layer actually retrieve
a \representative" vector.
All three subnetworks are quasi-independent and training at one level is performed only after
the training at previous level have been nished.
5.1.1
❖
zx
❖
B
The Input Layer
Let consider the x input layer2 Let be N the dimension of vector x and K the dimension of
vector y. Let zx be the output vector of the input x layer. See gure 5.2 on the next page.
The input layer have to achieve a normalization of input; this may be done if the neuronal
activity on the input layer is de ned as follows:
each neuron receive a positive excitation proportional to it's corresponding input, i.e.
+Bxi , B = const., B > 0,
2
An identical discussion goes for y layer, as previously speci ed.
5.1. THE CPN ARCHITECTURE
z
69
x
}|
{
input
level
|
{z
zx
}
Figure 5.2: The input layer.
each neuron receive a negative excitation from all neurons on the same layer, including
itself, equal to ;zxi xj ,
the input vector x is applied at time t = 0 and removed (x becomes 0b) at time t = t0 ,
and
in the absence of input xi , the neuronal output zxi decrease to zero following an
exponential, de ned by A = const., A > 0, i.e. zxi / e;At .
Then the neuronal behaviour may be summarized into:
N
dzxi = ;Az + Bx ; z X
xj for t 2 [0; t0)
xi
i
xi
dt
j =1
dzxi = ;Az
for t 2 [t0 ; 1)
xi
dt
or in matrix notation:
dzx = ;Az + B x ; z (xT 1b)1b for t 2 [0; t0)
x
xi
dt
dzx = ;Az
for t 2 [t0 ; 1)
x
dt
(5.1a)
(5.1b)
The boundary conditions are zx (0) = 0b (starts from 0b, no previously applied signal) and
b
b
tlim
!1 zx (t) = 0 (returns to 0 after the signal have been removed). For continuity purposes
the condition tlim
z (t) = lim zx (t) should be imposed. With these limit conditions, the
%t x
t&t
solutions to (5.1a) and (5.1b) are:
0
x=
z
z
x=
0
Bx
n
h
io
Bx
n
h
io
for t 2 [0; t0 )
1 ; exp ; A + xT 1b t
A + xT 1b
1 ; exp ; A + xT 1b t0 e;A(t;t ) for t 2 [t0 ; 1)
A + xT b1
0
(5.2a)
(5.2b)
❖
A
70
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
Proof.
1.
From (5.1a), for t 2 [0; t0 ):
0
1
N
dzx + A + xT b1 z 1b = Bx ) dzxi + @A + X
xj A zxi = Bxi
xi
dt
dt
j =1
First a solution for the homogeneous equation is to be found:
0 N 1
0 N 1
X
dz
dzxi + @A + X
xi
A
xj zxi = 0 ) z = ; @A + xj A dt )
dt
xi
j =1
j =1
2 0 N 13
X
zxi = zxi exp 4; @A + xj A t5
0
❖
zx0
j =1
where zx0 is the integral constant.
The general solution to the non-homogeneous equation is found considering zxi0 = zxi0 (t). Then from
(5.1a) (xi = const.):
2 0 N 13
dzxi0 exp 4; @A + X
xj A t5 = Bxi )
dt
j =1
20 N 1 3
20 N 1 3
Z
X A5
X A5
Bx
i
4
@
exp 4@A +
xj t )
zxi0 = Bxi exp A + xj t dt =
n
P
j =1
j =1
A + xj
j =1
Bx
i ! = const.
zxi =
N
P
A + xj
j =1
This solution is not convenient because it will mean an instant jump from 0 to maximal value for zxi when
x is applied (see the initial condition) or it will be the trivial solution zx = x = 0. As it was obtained in the
assumption that dzdtxi0 6= 0, this means that this is not valid and thus zxi0 have to be considered constant.
Then the general solution to (5.1a) is:
2 0 N 13
X
zxi = zxi0 exp 4; @A + xj A t5 + const. ; zxi0 = const.
j =1
and then, replacing back into (5.1a) and using rst boundary condition gives (5.2a).
0
2. From equation (5.1a), by separating variables and integrating, for t 2 [t ; 1):
zxi = zxi0 e;A(t;t ) zxi0 = const.
Then, from the continuity condition and (5.2a), the zxi0 is:
8
2 0 N 1 39
<
X A 0 5=
Bx
i
xj t ;
1 ; exp 4; @A +
zxi0 =
N
:
j =1
A + P xj
j =1
0
The output of a neuron from the input layer as function of time is shown in gure 5.3 on the
facing page. The maximum value attainable on output is zxmax = A+BxxTb1 , for t = t ! 1.
0
✍
Remarks:
➥
In practice, due to the exponential nature of the output, close values to zximax
are obtained for t relatively small, see again gure 5.3 on the next page, about
98% of the maximum value was attained at t = t = 4.
0
0
5.1. THE CPN ARCHITECTURE
Bxi
N
A+ P xj
71
zxi
j=1
0
t
0
Figure 5.3:
➥
=4
10
The output of a neuron from the input layer as function
of time.
Even if the input vector x is big (xi ! 1, i = 1; N ) the output is limited but
proportional with the input:
xmax =
z
x
Bx
A + xT 1b
=
x
B xT 1b / where = xi / x
i
xi P
N
A + xT 1b x
xj
j =1
being named the re ectance pattern and is \normalized" in the sense that
sums to unity:
5.1.2
t
0
P xj
N
j =1
❖ x
.
=1
The Hidden Layer
The Instar
The neurons from the hidden layer are called instars .
The input vector is z = fzi gi=1;N +K | here z will contain the outputs from both x and
y layers, let be H the dimension3 (number of neurons) and zH = fzHk gk=1;H the output
vector of the hidden layer.
Let fwki g k=1;H be the weight matrix (by which z enters the hidden layer) such that the
, H , zH
❖
z
❖
a, b
i=1;N +K
input to neuron k is W (k; :) z.
The equations governing the behavior of a hidden neuron k are de ned in a similar way as
those of input layer (z is applied from t = 0 to t = t ):
0
dzHk
dt
dzHk
dt
=
;azHk + bW (k; :) z for t 2 [0; t )
(5.3a)
=
;azHk
(5.3b)
0
for t 2 [t ; 1)
0
3 The fact that this equals the number of classes is not a coincidence, later it will be shown that there
have to be at least one hidden neuron to represent each class.
72
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
where a; b 2 R+ , a; b = const., and boundary conditions are z (0) = 0, lim
!1 z (t) = 0
and, for continuity purposes lim z (t) = lim z (t).
Hk
%
t
❖
,d
c
t0
&
Hk
t
t0
t
Hk
Hk
The weight matrix is de ned to change as well (the network is learning) according to the
following equation:
(
( :) = ;
0
dW k;
c
( :) ; zT
W k;
if z =
6 0
if z = 0
(5.4)
k
dt
k
where c; d 2 R+ , c; d = const. and boundary condition W (k; :)(0) = 0b, second case being
introduced to avoid the forgetting process.
✍
Remarks:
➥
In the absence of the input vector z = 0 if the learning process would continue
then:
dW (k; :)
!1 0
= ;cW (k; :) ) W (k; :) = C e; ;;;!
ct
t
dt
(C being a constant row matrix).
Assuming that the weight matrix is changing much slower that the neuron output then
W (k; :) z ' const. and the solution to (5.3a) and (5.3b) are:
Proof.
(5.2b).
zH k
=
zH k
=
b
a
b
a
;
( :) z 1 ; ;
W k;
e
at
( :) z 1 ; ;
W k;
e
0
at
e
; (;
a t
0
t
)
for t 2 [0; t0)
(5.5a)
for t 2 [t0 ; 1)
(5.5b)
It is proven in a similar way as for the input layer, see section 5.1.1, proof of equations (5.2a) and
The output of hidden neuron is similar to the output of input layer, see also gure 5.3 on
the preceding page, with the remark that the maximal possible value for z is z max =
0
W (k; :) z for t; t ! 1.
Assuming that an input vector z is applied and kept suciently long then the solution to
(5.4) is of the form:
Hk
Hk
b
a
( :) = zT (1 ; ; )
W k;
e
ct
!1 zT .
i.e. W (k; :) moves towards z. If z is kept applied suciently long then W (k; :) ;;;!
t
Proof. The di erential (5.4) is very similar to previous encountered equations. It may be proven also by
direct replacement.
Let be a set of input vectors fz g =1 applied as follows: z1 between t = 0 and t = t1 ,
:::, z
between t = t ;1 and t = t . Then the learning procedure is:
p
P
➀
❖
t0
➁
P
p
;P
P
Initialize the weight matrix: W (k; :) = b0.
Considering t0 0, calculate the next value for weights:
;
; 1
W (k; :)(t1 ) = z1 1 ; e
ct
5.1. THE CPN ARCHITECTURE
73
;
zp 1 + zp
W (k;
;
zp 1
W (k;
:)(tp )
;
:)(tp 1 )
zp
zp
;
1
;
; ( p; p;1 )
e
c t
t
Figure 5.4: The directions taken by weight vectors, relatively to input,
in hidden layer.
W (k;
1
:)(t2 ) = z2
W (k;
:)(tP ) = zP
=
; ( 2; 1 )
;
1
;
e
P
X
zp
c t
e
+ W (k; :)(t1 ) : : :
; ( P ; P ;1 )
c t
;
1
t
e
t
; ( p; p;1 )
c t
t
;
+ W (k; :)(tP 1 )
p=1
The nal weight vector W (k; :) represents a linear combination of all input vectors
;
fz g =1 . Because the coecients 1 ; e; ( p; p;1 ) are all positive then the nal
direction of W (k; :) will point to an \average" direction pointed by fz g =1 and
this is exactly the purpose of de ning (5.4) as it was. See gure 5.4.
p
c t
p
t
;P
p
✍
p
;P
Remarks:
➥
It is also possible to give each z a time-slice dt and when nishing with z to
start over with z1 till some (external) stop conditions are met.
p
P
\
\
The trained instar is able to classify the direction of input vectors (see (5.5a)):
zH k
/
W (k;
:) z =
k
W (k;
k kzk cos(
:)
W (k;
:); z)
/ cos(
W (k;
:); z)
The Competitive Network
The hidden layer of CPN is made out of instars interconnected such that each inhibits all
others and eventually there is only one \winner" (the instars compete one against each
other). The purpose of hidden layer is to classify the normalized input vector z (who is
proportional to input). Its output is of the one-of-k encoding type, i.e. all neurons have
output zero except a neuron k. Note that it is assumed that classes do not overlap, i.e. an
input vector may belong to one class only.
74
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
}|
z
{
z
f ( zH
1)
f (zH
2)
f (zH;N
|
{z
❖
f
❖
a
K)
}
zH
Figure 5.5:
+
The competitive | hidden | layer.
The property of instars that their associate weight vector moves towards an average of input
have to be preserved, but a feedback function is to be added, to ensure the required output.
Let f = f (z ) be the feedback function of the instars, i.e. the value added at the neuron
input. See gure 5.5
Then the instar equations (5.3a) and (5.3b) are rede ned as:
Hk
,b
dzHk
dt
=
;
azH k
+ b[f (zHk ) + W (k; :) z]
X+
N
;
zH k
`
dzHk
dt
where
a; b
2 R+ ,
=
a; b
K
=1
[f (zH` ) + W (`; :) z]
;
=
dt
dzH
dt
=
;
=
;
azH
const.; the expression
azH
+ b[f (zH ) + W z]
(5.6a)
for t 2 [t ; 1)
(5.6b)
0
0
azH k
;z
H
P+
N
W (k;
total input of hidden neuron k. In matrix notation is:
dzH
for t 2 [0; t )
:) z +
`
K
=1
[f (zH ) + W z]
f (zH` )
representing the
for t 2 [0; t )
0
for t 2 [t ; 1)
0
The feedback function f have to be selected such that the hidden layer performs as a
competitive layer, i.e. at equilibrium all neurons will have the output zero except one, the
\winner" which will have the output \ 1". This type of behaviour is also known as \winnertakes-all".
5.1. THE CPN ARCHITECTURE
75
For a feedback function of type f (z ) = z r where r > 1, equations (5.6) de nes a competitive
layer,
First a change of variable is performed as follows:
Proof.
e
zH k
P
+
and
zH k
N
K
+K
X
N
zH;tot
=1
(5.7)
zH `
=1
`
zH `
`
)
zH k
=
and
e
zH k zH;tot
By making the sum over k on (5.6a) (and f (z
dzH;tot
=
dt
and substituting z
Hk
dzH k
dt
As
=
;
=
e
;
azH;tot
H`)
e
e
dzH k
=
dt
zH;tot
dt
e
=
f (zH ` zH;tot )
+K
X
+z
eH k
dzH;tot
dt
):
N
;
zH;tot )
`
zH k zH;tot
azH k zH;tot
+ (b
dzH k
=1
in (5.6a):
+K
X
N
;e
+ b[f (z
eH k zH;tot ) + W (k; :) z]
(5.8)
[f (z
eH ` zH;tot ) + W (`; :) z]
zH k zH;tot
=1
[f ( z
eH ` zH;tot ) + W (`; :)z]
(5.9)
`
e
dzHk
dt
e
zH;tot
dzH k
dt
=
zH;tot
dzHk
dt
;e
zH k
and from (5.8) and (5.9):
" +
X
dzH;tot
dt
= b[f (z
eH k zH;tot ) + W (k; :) z]
;
N
e
bzH k
=1
+K
X
e
f (zH k zH;tot )
+
=1
The following cases, with regard to the feedback function, may be discussed:
+
The identity function: f (ze z tot ) = ze z tot . Then, by using P
=1
become:
+
X
dz
e
N
K
hi
H k H;
= 1
`
Hk
dt
e
=
bW (k; :)
z ; bzeH k
N
e
e
f (zH k zH;tot )
in the form:
e
dt
W (`; :)
=1
z
=0
dt
z
W (k; :)
)
z
/ P
+
W (k; :)
zH k
N
=b
=1
z
W (`; :)
z
+K
2 . Again, as NP
z
eH ` = 1, (5.10) can be rewritten
`=1
= (z
eH k zH;tot )
+K
X
`
K
`
N
zH;tot
, the above equation
, is:
dzHk
W (k; :)
+K
P
N
=1
dzH k
(5.10)
K
`
The square function:
z
`
The stable value, obtained by stating
zH k
=
zH;tot
W (`; :)
`
`
H k H;
#
N
K
=1
e
e
zH ` f (zH ` zH;tot )
+K
X
e
bzH k
`
+K
X
N
= bzH;tot z
eH k
`
+ bW (k; :)z
=1
;
+K
X
N
=1
e
e
zH ` f (zH ` zH;tot )
`
N
;
+ bW (k; :) z
;
e
=1
W (`; :)
z
e
z
eH k zH;tot
f (zH k zH;tot )
zH `
+K
X
N
e
bzH k
=1
`
z
W (`; :)
;
e
z
eH ` zH;tot
f (zH ` zH;tot )
❖ zeH k ,
zH;tot
76
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
Then, considering the expression of f , the term:
e
e
f (zH k zH;tot )
zH k zH;tot
reduces to z tot (ze ; ze
it represents an inhibition.
H;
H`)
Hk
which for ze
;
e
H k > zH `
e
e
f (zH ` zH;tot )
zH ` zH;tot
represents an ampli cation while for ze
e
H k < zH `
P+ W (`; :) z represents
The term bW (k; :) z is a constant with respect to ze and the term ;bze
=1
an inhibitory term.
So the di erential equations describe the behaviour of a network where all neurons inhibit all others
with less output than theirs and are inhibited by neurons which have greater output. The gap between
neurons with high output and those with low output gets ampli ed. In this case the layer acts like
an \winner-takes-all" network, eventually only one neuron, that one with the largest ze
will have a
non zero output.
The same discussion occurs for f (ze z tot ) = (ze z tot ) where ` > 1.
N
Hk
K
Hk
`
Hk
H k H;
H k H;
`
And nally, only the winning neuron, let k be the one, have to be a ected by the learning
process: this neuron will represent the class Ck to which input vector belongs and its
associated weight vector W (k; :) have to be moved towards the average \representative"
fhxik ; hyik g (combined as to form one vector), all other neurons should remain untouched
(weights unchanged).
Two points to note here:
It becomes obvious that there should be at least one hidden neuron for each class.
Several neurons may represent the same class (but there will be only one winner
at a time). This is particularly necessary if classes are represented by unconnected
domains in RN +K (because for a single neuron representative, the moving weight
vector, towards the average input for the class, may become stuck somewhere in the
middle and represent another class) or for cases with deep intricacy between various
classes. See also gure 5.6 on page 81.
5.1.3
The Output Layer
The neurons on the output level are called outstars. The output level contains 2 layers: x
of dimension N and y of dimension K | same as for input layers. For both the discussion
goes the same way. The weight matrix describing the connection strengths between hidden
and output layer is W .
The purpose of output level is to retrieve a pair fhxik ; hyik g, where hxik is closest to input
x, from an \average" over previously learned training vectors from the same class. The x
layer is discussed below, the y part is, mutatis mutandis, identical.
The di erential equations governing the behavior of outstars are de ned as:
0
0
❖
W
0
0
0
0
❖
A,B,C
0
0
0
dxi = ;A x + B x + C W (i; :) z for t 2 [0; t )
i
H
i
dt
dxi = ;A x + C W (i; :) z
for t 2 [t ; 1)
H
i
dt
where A ; B ; C 2 R+ , A ; B ; C = const., or in matrix notation:
dx = ;A x + B x + C W (1 : N; :) z for t 2 [0; t )
H
dt
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(5.11a)
5.1. THE CPN ARCHITECTURE
77
dx0 = ;A0 x0 + C 0 W 0 (1 : N; :) z
H
dt
for t 2 [t0 ; 1)
(5.11b)
with boundary condition x0 (0) = 0b.
The weight matrix is changing | the network is learning | by construction, according to
the following equation:
(
dW 0 (1 : N; k) = ;D0 W 0 (1 : N; k) + E 0 x if zHk 6= 0
dt
0b
if zHk = 0
(5.12)
with the boundary condition W 0 (1 : N; k)(0) = b0 (the part for zHk = 0 being de ned in
order to avoid weight \decaying"). Note that for a particular input vector there is just one
winner on the hidden layer and thus just one column of matrix W 0 gets changed, all other
remain the same (i.e. just the connections coming from the hidden winner to output layer
get updated).
It is assumed that the weight matrix is changing much slower than the neuron output.
Considering that the hidden layer is also much faster than output (only the asymptotic
behaviour is of interest here), i.e. W (i; :) zH ' const. then the solution to (5.11a) is:
x0 =
0
B
C 0 W 0 (1 : N; :) z (1 ; e;A t )
H
A0 x + A0
0
Proof. For each x : B xi + C W (i; :) zH = const. and then the solution is built the same way as for
i
previous di erential equations by starting to search for the solution for homogeneous equation. See also
proof of equations (5.2). The boundary condition is used to nd the integration constant.
0
0
0
0
The solution to weights update formula (5.12) is:
E 0 x(1 ; e;D t )
W 0 (1 : N; k) = D
0
0
Proof.
Same way as for x , above, for each wik separately.
0
The asymptotic behaviour for t ! 1 of x0 and W 0 (1 : N; k) are (from the solutions found):
0
0
0
C
E
0
0 B
!1 W (1 : N; k) = D0 x
tlim
!1 x = A0 x + A0 W (1 : N; :) zH and tlim
(5.13)
After a training with a set of fxp ; yp g vectors, from the same class k, the weights will be:
W 0 (1 : N; k) / hxik respectively W 0 (N + 1 : N + K; k) / hyik .
At runtime x 6= 0b and y = 0b to retrieve an y0 (or vice-versa to retrieve an x0 ). But (similar
to x0 )
0
0
B
C
lim
y0 = 0 y + 0 W (N + 1 : N + K; :) zH
t!1
A
A
and, as zH represents an one-of-k encoding, then W (N + 1 : N + K; :) zH selects the
column k out of W 0 (for the corresponding winner) and as y = 0b then
y0 / W 0 (N + 1 : N + K; k ) / hyik
❖
D0 , E 0
78
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
➧ 5.2
CPN Dynamics
✍
Remarks:
➥
➥
➥
5.2.1
In simulations on digital systems the normalization of vectors and the decision
over the winner in the hidden layer may be done in separate processes and such
simplifying and speeding up the network running.
The process uses the asymptotic (equilibrium) values to avoid the actual solving
of di erential equations.
p
The vector norm may be written as kxk = xT x.
Network Running
To generate the corresponding y vector for the input x:
➀ The input layer normalizes the input vectors and distributes the result to the hidden
layer. For the y part a null vector 0b is applied:
= p xT
z(1 : N )
z(N + 1 : N + K ) = 0b
i.e. z is the combination of vectors
➁
x
kxk
(as kyk = 0)
x x
and 0b to form a single vector.
The hidden layer is of \winner-takes-all" type.
To avoid a di erential equation calculation the weight vectors W (`; :) are normalized.
This way the closest one to normalized input z will be found by doing a simple scalar
product: W (`; :) z / cos(W\
(`; :) z).
The \raw" output of the hidden neurons is calculated rst: zH` / W (`; :) z, then the
neuron with the largest output is declared winner and it gets the output 1, all others
get output 0. Let k be the winning neuron, i.e. W (k; :) z = max
W (`; :) z. Then
`
initialize zH to b0 and afterwards make zHk = 1:
zH = 0b and afterwards
z
Hk = 1
and so all outstars receive an input vector of the form:
;
zTH = 0
:::
1
:::
0
(5.14)
where all zH` are zero, except zHk .
Note that as y = 0b is as all things happen in the space RN RN +K . The scalar
product may be replaced with the scalar product between kxxk and the projection of
W (`; :), i.e. W (`; 1 : N ).
5.2
See [FS92] pp. 235{239.
5.2. CPN DYNAMICS
79
From (5.13) and making y = b0, C = A and E = D then the output of the y layer
is y = W (N + 1 : N + K; k), i.e. the winner of hidden layer selects what column of
W will be chosen to be the output (the W (1 : N; k) represents the x while W (:; k)
represents the joining of x and y ) as the multiplication between W and a vector of
type (5.14) selects column k out of W .
To generate the corresponding x from y, i.e. working in reverse, the procedure is the same
(by changing x $ y).
➂
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5.2.2
➀
➁
Network Learning
An input vector, randomly selected, is applied to the input layer.
The input layer normalize the input vector and distribute it to the hidden layer.
=p T x T
x x+y y
y
z(N + 1 : N + K ) = p T
x x + yT y
z(1 : N )
➂
i.e. z is the normalized combination of vectors x and y to form a single vector.
The training of the hidden layer is done rst. The weights are initialized with randomly
selected normalized input vectors. This ensure both the normalization of weight
vectors W (`; :) and a good spread of them.
The hidden layer is of \winner-takes-all" type. To avoid a di erential equation calculation, and as W (`; :) are normalized, the closest W (`; :) to z is found by using the
scalar product W (`; :) z / cos(W (`; :) z).
The \raw" output of the hidden neurons is calculated rst as scalar products: zH` /
W (`; :) z, then the neuron with the largest output, i.e. the k one for which W (k; :) z
= max
W (`; :) z, is declared winner and it gets the output 1, all others get output 0:
`
\
zH = 0b and afterwards
zHk = 1
and so all outstars receive an input vector of the form
;
zTH = 0
::: 1 ::: 0
where all zH` are zero with one exception zHk .
The weight of the winning neuron is updated according to the equation (5.4). In
discrete time approximation: such that:
dt ! t = 1 and dW (k; :) ! W (k; :) = W (k; :)(t + 1) ; W (k; :)(t) )
W (k; :)(t + 1) = W (k; :)(t) + c[zT ; W (k; :)(t)]
The above updating is repeated for all input vectors.
The training is repeated until the input vectors are recognized correctly, e.g. till the
angle between the input vector and the output vector is less than some maximum
error speci ed: cos(W (k; :); z) < ".
\
80
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
➃
The network may be tested with some input vectors not used before. If the classi cation is good (the error is under the maximal one) the the training of hidden layer is
done, else the training continues.
After the training of the hidden layer is nished the training of the output layer begins.
An input vector is applied, the input layer normalizes it and distribute it to the trained
hidden layer. On the hidden layer only one neuron is winner and have
output non-zero,
;
let k be that one, such that the output vector becomes z = 0 : : : 1 : : : 0
with 1 on the k position.
The weight of the winning neuron is updated according to the equation (5.12). In
discrete time approximation:
H
( :) ! W (k; :) = W (k; :)(t + 1) ; W (k; :)(t) )
W (1 : N; k )(t + 1) = W (1 : N; k )(t) + E [x ; W (1 : N; k )(t)] (5.15)
W (N + 1 : N + K; k )(t + 1) = W (N + 1 : N + K; k )(t)
+ E [y ; W (N + 1 : N + K; k)(t)]
dt
! t = 1 and
dW 0 k;
0
0
0
0
0
0
0
0
0
0
0
The above updating is repeated for all input vectors.
The training is repeated until the input vectors are recognized correctly, e.g. till the
error is less than some maximum error speci ed: w ; x < " for i = 1; N and similar
for y.
0
ik
✍
i
Remarks:
➥
➥
From the learning procedure it becomes clear that the CPN is in fact a system
composed of several semi-independent subnetworks:
the input level who normalize input,
the hidden layer of \winner-takes-all" type and
the output level who generate the actual required output.
Each level is independent and the training of next layer starts only after the
learning in precedent layer have been done.
Usually the CPN is used to classify an input vector x as belonging to a class
represented by hxi . A set of input vectors fx g will be used to train the network
such that it will have the output fhxi ; hyi g if x 2 C (see also gure 5.6 on
the facing page).
Unfortunate choice of weight vectors for the hidden layer W (`; :) may lead to the
\stuck-vector" situation when one neuron from the hidden layer may never win.
See gure 5.6 on the next page: the vector W (2; :) will move towards x1 2 3 4
during learning and will become representative for both classes 1 and 2 | the
corresponding hidden neuron 2 will be a winner for both classes.
The \stuck vector" situation can be avoided by two means. One is to initialize
each weight by a vector belonging to the class for which the corresponding hidden
neuron will win | the most representative if possible, e.g. by averaging over the
training set. The other is to attach to the neuron an \overloading device": if
the neuron wins too often | e.g. during training wins more that the number of
k
p
k
➥
k
k
; ; ;
5.3. THE ALGORITHM
W
81
hxi1
(2 :)
;
1
Class 1
Class 2
2
x
x
hxi2
3
x
W
(1 :)
;
4
x
Figure 5.6:
➥
➥
The \stuck vector" situation.
training vectors from the class it suppose to represents | then it will shut down
allowing other neurons to win and corresponding weights to be changed.
The hidden layer should have at least as many neurons as the number of classes
to be recognized. At least one neuron is needed to win the \competition" for
the class it represents. If classes form unconnected domains in the input space
N +K then one neuron at least is necessary for each connected domain.
z 2 R
Otherwise the \stuck vector" situation is likely to appear.
For the output layer the critical point is to select an adequate learning constant
E : the learning constant can be chosen small 0 . E 1 at the beginning and
increased later when xi ; wik (t) decreases, see equation (5.15).
Obviously the hidden layer may be replaced by any other system able to perform
an one-of-k encoding.
0
0
0
➥
➧ 5.3
The Algorithm
The running procedure
1. The x vector is assumed to be known and the corresponding hyik is to be retrieved.
For the reverse situation | when y is known and hxik is to be retrieved | the
algorithm is similar changing x $ y.
Make the y null (0b) at input and compute the normalized input vector z:
z
(1 : ) = p xT
and
N
x x
( +1:
z N
N
2. Find the winning neuron on hidden layer, the k for which:
3. Find the y vector in W matrix:
y = W (N + 1 : N + K; k )
0
0
The learning procedure
1. Let x and y be one set of input vectors
+ ) = 0b
K
( :) z = max
` ( :) z.
W k;
W `;
82
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
Let N be the dimension of the \x" part and K the dimension of the \y". Then
+ K is the number of neurons in the input layer.
2. For all fxp ; yp g training sets, normalize the input vector , compute zp .
N
=q
zp (1 : N )
zp (N + 1 : N + K ) =
q
xp
xTp xp + ypT yp
yp
xTp xp + ypT yp
Note that the input layer does just a normalisation of the input vectors. No further
training is required.
3. Initialize weights on hidden layer. For all ` neurons (` = 1; H ) in hidden layer select
an representative input vector z` for class ` and then W (`; :) = zT` (this way the
weight vectors become automatically normalized).
Note that in extreme case there may be just one vector available for training for each
class. In this case that vector becomes the \representative".
4. Train the hidden layer. For all normalized training vectors zp nd the winning neuron
on hidden layer, the k one for which W (k; :) zp = max
W (`; :) zp .
`
Update the winner weights:
( :) =
Wnew k;
( :) + [zTp ;
Wold k;
c
( :)]
Wold k;
The training of hidden layer have to be nished before moving forward to the output
layer.
5. Initialize the weights on output layer. As for the hidden layer, select an representative
input vector pair fxk ; yk g for each class Ck .
(1 :
( +1: +
W
W
0
N
0
N
N; `
) = xk
) = yk
K; `
Another possibility would be to make an average over several fxp ; yp g belonging to
the same class.
6. Train the output layer. For all training vectors zp nd the winning neuron on hidden
layer, the k one for which W (k; :) zp = max
W (`; :) zp .
`
Update the winner's output weights:
(1 :
+
new ( + 1 :
0
Wnew
W
0
N
N
N; k
K; k
)=
)=
(1 :
) + [xp ;
+
)
old ( + 1 :
+ [yp ; old ( + 1 :
0
Wold
W
N; k
0
E
N
0
E
N
W
0
0
0
Wold
(1 :
N; k
K; k
N
N
+
K; k
)]
)]
5.4. APPLICATIONS
83
The representative set for letters A, B, C.
Figure 5.7:
Figure 5.8:
➧ 5.4
The training set for letters A, B, C.
Applications
5.4.1 Letter classi cation
Being given a set of letters as binary image into a 5 6 matrix the network have to correctly
associate the ASCII code to the image even if there are missing parts or noise in the image.
The letters are uppercase A, B, C so there are 3 classes and corresponding 3 neurons on
the hidden layer. The representative letters are in gure 5.7.
The x vectors are created by reading the graphical representation of the characters on rows;
a \dot" gets an 1, its absence gets an 0:
(
T (A) =
x
rep.
0
0
1
0
0
;
0
1
0
1
0
;
1
0
0
0
1
;
1
1
1
1
1
;
1
0
0
0
1
1
0
0
0
1
(
1
1
1
1
0
;
1
0
0
0
1
;
1
1
1
1
0
;
1
0
0
0
1
;
;
1
0
0
0
1
;
)
1
1
1
1
0
)
T (B ) =
x
;
(
T (C ) =
x
rep.
rep.
0
1
1
1
1
;
1
0
0
0
0
;
1
0
0
0
0
;
1
0
0
0
0
;
1
0
0
0
0
;
0
1
1
1
1
)
such that they are 30{dimensional vectors.
The ASCII codes are 65 for A, 66 for B and 67 for C. They are converted to binary format
and to a y 8{dimensional vector:
T A ) = ;0
;
T
y (B) = 0
y (
1
0
0
0
0
0
1
1
0
0
0
0
1
0
84
CHAPTER 5. THE COUNTERPROPAGATION NETWORK
a)
b)
c)
d)
Figure 5.9:
TC
The testing set for letters A, B, C.
y ( ) =
;
0
1
0
0
0
0
1
1
The training letters are depicted in gure 5.8 on the page before | the rst set contains
a \dot" less (information missing), the second one contains a supplementary \dot" (noise
added) while the third set contains both.
The test letters are depicted in gure 5.9 | rst 3 sets are similar, but not identical, to the
training sets and they where not used in training. The system is able to recognize correctly
even the fourth set (labeled d) from which a large amount of information is missing.
✍
Remarks:
➥
➥
The conversion of training and test sets to binary x vectors is done into a similar
ways as for the representative set.
The training was done just once for the training set (one epoch) with the following
constants:
c = 0:1
and E = 0:1
0
➥
➥
At run-time the y vector becomes 0b.
If a large part of the information is missing then the system may miss-classify the
letters due to the fact that there are \dots" in common positions (especially for
letters B and C).
CHAPTER
6
Adaptive Resonance Theory (ART)
The ART networks are an example of ANN composed of several subsystems and able to
resume learning at a later stage without having to restart from scratch.
➧ 6.1
The ART1 Architecture
The ART1 network is made from 2 main layers of neurons: F1 and F2 , a gain control unit
(neuron) and a reset unit (neuron). See gure 6.1 on the next page. The ART1 network
works only with binary vectors.
The 2/3 rule. The neurons from F1 layer receive inputs from 3 sources: input, gain control
and F2 . The neurons from F2 layer also receive inputs from 3 sources: F1 , gain control and
reset unit. Both layers F1 and F2 are build such that they become active if and only if 2
out of 3 input sources are active.
✍
2/3 rule
Remarks:
The input is considered to be active when the input vector is non-zero i.e. it have
at least one non-zero component.
➥ The F2 layer is considered active when its output vector is non-zero i.e. it have
at least one non-zero component.
The propagation of signals trough the network is done as follows:
➀ The input vector is distributed to F1 layer, gain control unit and reset unit. Each
component of input vector is distributed to a di erent F1 neuron | F1 have the same
dimension N as input x.
6.1
ART1
F 1 , F2
❖
➥
See [FS92] pp. 293{298.
85
❖
x, N
86
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
Input vector
layer F1
Gain Control
111111
111111
000000
000000
111111
111111
000000
000000
111111
111111
000000
000000
to111111
all neurons
111111
000000
000000
111111
111111
000000
000000
111111
111111
000000
000000
111111
000000
000000
fully interconected111111
111111
000000
000000
Reset
layers, both ways 111111
111111
111111
000000
000000
111111
111111
000000
000000
111111
111111
000000
000000
111111
000000
000000
to111111
all neurons
to all neurons
111111
111111
000000
000000
000000
111111
000000
111111
000000
111111
000000
111111
layer F2
with lateral feedback
Figure 6.1:
➁
➂
➃
➄
➅
➆
The ART1 network architecture. Inhibitory input is
marked with symbol.
The output of F1 is sent as inhibitory signal to the reset unit. The design of the
network is such that the inhibitory signal from F1 cancels the input vector and the
reset unit remains inactive.
The gain control unit send a nonspeci c excitatory signal to F1 layer (an identical
signal to all neurons from F1 ).
F2 receives the output of F1 (all neurons between F1 and F2 are fully interconnected).
The F2 layer is of contrast enhancement type: only one neuron should trigger for a
given pattern (or, in a more generalized case only few neurons should \ re" | have
a nonzero output).
The output of F2 is sent back as excitatory signal to F1 and as inhibitory signal to
the gain control unit. The design of the network is such that if the gain control unit
receives an inhibitory signal from F2 it ceases activity.
Then F1 receives signals from F2 and input (the gain control unit have been deactivated). The output of the F1 layer changes such that it isn't anymore identical to
the rst one, because the overall input had changed: the gain control unit ceases
activity and | instead | the F2 sends its output to F1 . Also there is the 2/3 rule
which have to be taken into account: only those F1 neurons who receive input from
both input and F2 will trigger. Because the output of F1 had changed, the reset unit
becomes active.
The reset unit send a reset signal to the active neuron(s) from the F2 layer which forces
6.2. ART1 DYNAMICS
➇
87
it (them) to become inactive for a long period of time, i.e. they do not participate into
the next network pattern matching cycle(s). The inactive neurons are not a ected.
The output of F2 disappears due to the reset action and the whole cycle is repeated
until a match is found i.e. the output of F2 causes F1 to output a pattern which will
not trigger the reset unit, because is identical to the rst one, or | no match was
found, the output of F2 is zero | a learning process begins in F2 .
The action of the reset unit (see previous step) ensures that a neuron already used in
the \past" will not be used again for pattern matching.
✍
Remarks:
➥
➧ 6.2
In complex systems an ART network may be just a link into a bigger chain. The
F2 layer may receive signals from some other networks/layers. This will make F2
to send a signal to F1 and, consequently F1 may receive a signal from F2 before
receiving the input signal.
A premature signal from F2 layer usually means an expectation. If the gain
control system and consequently the 2/3 rule would not have been in place then
the expectation from F2 would have triggered an action in absence of the input
signal.
With the presence with the gain unit F2 can't trigger a process by itself but it
can precondition F1 layer such that when the input arrives the process of pattern
matching will start at a position closer to the nal state and the process takes
less time.
expectation
ART1 Dynamics
The equation describing the activation (total input) of a neuron j from F1 2 layers is of the
form:
;
da = ;a + (1 ; Aa ) excitatory input ; (B + Ca ) inhibitory input
dt
where A, B , C are positive constants.
j
j
✍
j
j
(6.1)
Remarks:
➥
6.2.1
These equations do not describe the actual output of neurons with will be obtained from the activation by applying a \digitizing" function which will transform
the activation into a binary vector.
The
F1
layer
The neuron on F1 layer receives input x, input z from F2 and input from the gain control
unit as excitatory input. The inhibitory input is set to 1. See (6.1).
Let be a the activation of the neuron k from the F2 layer, f2 (a ) its output (f2 being the
0
6.2
0
0
k
k
See [FS92] pp. 298{310.
❖
a , f2 , W , y
k
88
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
activation function on layer F2 and f2 (a ) = y) and wjk the weight when entering neuron
j on layer F1 . Then the total input received by the F1 neuron from F2 is W (j; :) f2 (a ).
The F2 layer is of competitive (\winner-takes-all") type | there is only one winning neuron
which have a non-zero output, all others will have null output. The output activation
function for F2 neurons is a binary function1:
0
0
output of F2 neuron = f2 (ak ) =
(
❖
g
if winner is k
otherwise
1
0
0
The gain control unit is set such that if input vector x 6= 0b and the vector from F2 is
f2 (a ) = 0b then its output is 1, otherwise is 0:
0
g=
(
1
0
if x 6= 0b and f2 (a ) = 0b
otherwise
0
Finally the dynamic equation for a neuron j from the F1 layer becomes (from (6.1)):
❖ A 1 , B1 ,
daj
dt
=
;aj + (1 ; A1 aj )[xj + D1 W (j; :) f2 (a ) + B1 g] ; (B1 + C1 aj )
0
(6.2)
C1 , D1 where the constants A, B , C and D have been given the subscript 1 to denote that they are
for F1 layer. Obviously here D1 controls the amplitude of W weights, it should be chosen
such that all weights are wj` 2 [0; 1].
The following cases may be considered:
b) and F2 is inactive (f2(a )
➀ Input is inactive (x = 0
becomes:
0
daj
dt
➁
=
=
0b). Then
g = 0 and (6.2)
;aj ; (B1 + C1 aj )
At equilibrium dadtj = 0 and aj = ; 1+BC1 1 i.e. inactive F1 neurons have negative
activation.
Input is active (x 6= 0b) but F2 is still inactive (f2 (a ) = 0b) | there was no time for
the signal to travel from F1 to F2 and back (and deactivating the gain control unit
on the way back). The gain control unit is activated: g = 1 and (6.2) becomes:
0
daj
dt
At equilibrium dadtj
=
=0
;aj + (1 ; A1 aj )(xj + B1 ) ; (B1 + C1 aj )
and
aj = 1 + A (x x+j B ) + C
(6.3)
1 j
1
1
i.e. neurons who received non-zero input (xj 6= 0) have a positive activation (aj > 0)
and the neurons who received a zero input have their activation raised to zero.
1 See
also the
F2
section, below
6.2. ART1 DYNAMICS
➂
89
Input is active (x 6= 0b) and F2 is also active (f2 (a0 ) 6= b0). Then the gain control unit
is deactivated (g = 0) and (6.2) becomes:
daj = ;a + (1 ; A a )[x + D W (j; :) f (a0 )] ; (B + C a )
j
1 j
j
1
2
1
1 j
dt
At equilibrium
daj
dt
= 0 and
0 B1
aj = 1 + Axj (+x D+1 WD (j;W:)(j;f2:)(af )(;
a0 )) + C
1
j
1
2
(6.4)
1
The following cases may be discussed here:
(a) Input is maximum: xj = 1 and input from F2 is minimum: a0 ! 0b. Because the
gain control unit have been deactivated and the activity of F2 layer is dropping
to 0b then | according to the 2/3 rule | the neuron have to switch to inactive
state and consequently aj have to switch to a negative value. From (6.4):
(j; :) f2 (a0 ) ; B1
limb aj = limb 1 + Axj (+x D+1 W
D1 W (j; :) f2 (a0 )) + C1 < 0 )
1 j
a !0
a !0
0
0
B1 > 1
(6.5)
(as all constants A1 , B1 , C1 , D1 were positive de nite).
(b) Input is maximum: xj = 1 and input from F2 is non-zero. F2 layer is of a
contrast enhancement type (\winner takes all") and it have only (maximum)
one winner, let k be that one, i.e. W (j; :) f2 (a0 ) = wjk f2 (a0k ). Then according
to the 2/3 rule the neuron is active and the activation value should be aj > 0
and (6.4) becomes:
1 + D1 wjk f2 (a0k ) ; B1 > 0 ) wjk f2 (a0k ) > B1 ; 1
(6.6)
D1
There is a discontinuity between this condition and the preceding one: from
(6.6), if wjk f2 (a0k ) ! 0 then B1 ; 1 < 0 which seems to be in contradiction
with the previous (6.5) condition. Consequently this condition will be imposed
on W weights and not on constants B1 and D1 .
(c) Input is maximum xj = 1 and input from F2 is maximum, i.e. wjk f2 (a0k ) = 1
(see above, k is the F2 winning neuron). Then (6.4) gives:
1 + D1 ; B1 > 0 ) B1 < D1 + 1
(6.7)
As f2 (a0k ) = 1 (maximum) and because of the choosing of D1 constant: wjk 2
[0; 1] then wjk = 1.
(d) Input is minimum x ! 0b and input from F2 is maximum. Similarly to the rst
case above (and (6.4)):
0
x
j + D1 W (j; :)f2 (a ) ; B1
lim aj = limb 1 + A [x + D W (j; :) f (a0 )] + C < 0 )
x!b
0
x!0
1 j
1
2
1
D1 < B1
(6.8)
90
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
(because of the 2/3 rule at limit the F1 neuron have to switch to negative state
and subsequently have a negative activation).
(e) Input is minimum (x ! 0b) and input from F2 is also minimum (a0 ! 0b).
Similar to the above cases the F1 neuron turns to inactive state, so it will have
a negative activation and (6.4) (on similar premises as above) gives:
lim aj < 0 ) ;B1 < 0
!!
x b
0
a0 b
0
which is useless because anyway B1 constant is positive de nite.
Combining all of the above requirements (6.5), (6.7), and (6.8) in one gives:
max(1; D1 ) < B1 < D1 + 1
❖
f1
which represents one of the condition to be put on F1 constants such that the 2/3
rule will operate.
The output value for the j -th F1 neuron is obtained by applying the following activation
function:
(
f1 (aj ) = 1 if activation aj > 0
(6.9)
0 if activation aj 6 0
6.2.2
The
F2
layer
Initially the network is started at a \0" state. There are no signals traveling internally
and no input (x = 0b). So, the output of the gain control unit is 0. Even if the F2
layer receive a direct input from the outside environment (another network, e.t.c.)
the 2/3 rule stops F2 from sending any output.
➁ Once an input have arrived on F1 the output of the gain control unit switches to 1,
because the output of F2 (z0 ) is still 0.
Now the output of F2 is allowed. There are two cases:
(a) There is already an input from outside, then the F2 will output immediately
without waiting for the input from F1 { see the remarks about expectation in
section 6.1.
(b) If there is no external input then the F2 layer have to wait for the output of F1
before being able to send an output.
The conclusion is that on F2 level the gain control unit is used just to turn on/o the
right of F2 to send an output. Because the output of the gain control unit (i.e. 1) is sent
uniformly to all neurons it doesn't play any other active role and can be left out from the
equations describing the behavior of F2 units.
➀
✍
Remarks:
➥
In fact the equation describing the activation of the F2 neuron is of the form:
8 0
da
>
>
< dtk = ; a0k + (1 ; A2a0k ) excitatory input
output
:
; (B2 + C2 a0k ) inhibitory input
equation >
> a0 = 0
:
k
if g = 1
othewise
6.2. ART1 DYNAMICS
91
where ak is the neuron activation and A2 , B2 and C2 are positive constants (see
(6.1)). The rst part is analyzed below.
0
The neuron k on F2 layer receives an excitatory input from F1 : W (k; :) f1 (a) (where W
is the weight matrix of connections from F1 to F2 ) and from itself: h(ak ); h being the
feedback function:
excitatory input = W (k; :) f1 (a) + h(ak )
0
0
❖ a0 ,
❖
W,h
❖
K
❖
D2
0
0
and an indirect inhibitory input:
``==1k
PK h a`
``==1k
(
0
)
6
0
`; :)f1 (a), where K is the number of neurons on
(
6
F2 . The latter term represents the indirect inhibition (feedback) due to the fact that others
neurons will have an positive output (because of their input), while the former is due to
direct inter-connections (lateral feedback) between neurons in F2 layer.
inhibitory input =
0
0
The same k neuron receive an direct inhibitory input from all other F2 neurons:
PK W
A2 , B2 , C2
K
K
X
X
W
ha
(
``==1k
`) +
0
`; :)f1 (a)
0
(
``==1k
6
6
Eventually, from (6.1):
dak
dt
; ak + (1 ; A2 ak )(D2 W (k; :) f1 (a) + h(ak ))
0
0
=
0
; (B2 + C2 ak )
0
0
K
X
ha
``==1k
[ (
`) + W
0
0
`; :) f1 (a)]
(
6
where D2 is a multiplying positive constant.
Let B2 = 0, C2 = A2 and D2 = 1. Then:
dak
dt
0
=
;ak + h(ak ) + W (k; :) f1 (a) ; A2 ak
0
0
or in matrix notation:
da
dt
0
=
0
0
`=1
`) + W
0
[ (
0
(
`; :) f1 (a)]
;a + h(a ) + W f1 (a) ; A2 1bT [h(a ) + W f1 (a)]a
0
0
0
0
For an feedback function of the form h(ak )
de ne an competitive layer.
0
Proof.
K
X
ha
=
0
0
`=1
a0k
= eak atot
0
0
and
ak
0
a0tot
=
0
and
da0k
dt
0
ak m , where m > 1, the above equations
First the following change of variable is performed:
eak = PKak
0
(6.10)
K
X
`=1
e
a0`
❖ a0k , a0tot
)
= ddteak atot + eak dadttot
0
0
0
0
92
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
By doing a summation for all k in (6.10) and using the change of variable just introduced:
datot
dt
0
=
;atot +
K
X
`=1
h(ea` atot ) +
0
0
K
X
W (`; :) f1 (a) ; A2 atot
0
`=1
0
K
X
Then, from the substitution introduced and (6.10) and (6.11):
deak a = dak ; ea datot
k dt
dt tot dt
0
As
0
0
K
P
0
0
=
h(eak atot ) + W (k; :)f1 (a) ; eak
0
0
0
h a` a
[ (e0 tot ) +
`=1
K
X
`=1
W (`; :) f1 (a)]
h a` a
[ (e0 0tot ) +
(6.11)
0
W (`; :) f1 (a)]
0
ea = 1, the above equation may be rewritten as:
`=1 k
K
K
K
X
X
deak a = X
dt tot `=1 ea` h(eak atot ) ; `=1 eak h(ea` atot ) + W (k; :) f1 (a) ; eak `=1 W (`; :) f1 (a)
#
K
K "
X
X
=e
ak atot ea` h(eaeakaatot ) ; h(eaea`aatot ) + W (k; :) f1 (a) ; eak W (`; :) f1 (a)
k tot
` tot
`=1
`=1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
and on this formula the following cases may be considered:
Identity function: h(eak atot ) = eak atot then:
0
0
0
0
K
deak a = W (k; :) f (a) ; ea X
W (`; :) f1 (a)
1
total
k
dt
`=1
0
0
0
and the stable value (obtained from ddtaek
0
ea0k
0
=0
W (k; :) f1 (a)
) is:
=
W (k; :) f1 (a)
ak / P
K
W (`; :)f1 (a)
)
0
K
P
W (`; :) f1 (a)
`=1
0
0
0
0
`=1
i.e. the output is proportional to the weighted sum of inputs.
Square function: h(eak atot ) = (eak atot )2 then h(aeaekkaa ) ; h(aeae``aa
0
reduces to atot (eak ; ea` ) which for
eak > ea` represents an ampli cation while for eak < ea` represents an inhibition. The W (k; :) f1 (a) is
constant.
The F2 layer acts as an \winner-takes-all"2 network; the distance between neurons with large output
and those with small output widens | eventually only one neuron will have a non-zero output.
The same discussion goes for h(eak atot ) = (eak atot )m where m > 1.
0
0
0
0
0
0
tot
0
0
0
0
tot
0
0
tot )
0
0
0
tot
0
0
0
0
0
0
0
0
0
The winning F2 neuron sends a value of 1 to F1 layer, all other send 0. Let f2 (a ) be the
output (activation) function which value is sent to F1 :
0
❖
f2
8
<1
f2(a ) =
:0
0
k
✍
if a = max a
0
k
=1;K
`
(6.12)
`
otherwise
Remarks:
➥
It seems | at rst sight | that the F2 neuron have two outputs: one sent to
the F2 layer | h function and one sent to the F1 layer | f2 function. This runs
counter the de nition of a neuron | it should have just one output. However
this contradiction may be overcome if the neuron is replaced with an ensemble
of three neurons: the main one which calculate the activation a and send the
0
k
2
There is a strong similarity with the functionality of the hidden layer of a counterpropagation network
6.2. ART1 DYNAMICS
93
inputs
F2
neuron
main unit
f2
feedback
h
to F1
to F2
Figure 6.2: The F2 neuron structure.
result (it have the identity function as activation function) to two others which
receive its output (they have one input with weight 1), apply the h respectively
f2 functions and send their output wherever is required. See
gure 6.2.
6.2.3
F1 :
Learning on
The
W
weights
The di erential equations describing the F1 learning process, (i.e. W weights adaptation)
are:
dwjk
dt
;
=[
wjk
0
+ f1 (aj )]f2 (ak )
;
j
= 1; N
; k
= 1; K
(6.13)
There is just one (at most) \winner" | let k be that one | on F2 for which f2 (a0k ) 6= 0,
for all others f2 (a0̀ ) = 0, i.e. only the weights related to the winning F2 neuron are adapted
on F1 layer, all other remain unchanged (for a given input).
✍
Remarks:
➥
During the learning process only one (at most) component of the weight vector
W (j; :) changes for each F1 neuron, i.e. only column k of W changes, k being
the F2 winner.
Because of the de nition of the f1 and f2 functions (see (6.9) and (6.12)) the following
cases may be considered:
➀ F2 :k neuron winner (f2 (a0k ) = 1) and F1 :j neuron active (f1 (aj ) = 1), then:
dwjk
dt
=
;
wjk
+1
)
wjk
=1
;
e
;t
(solution found by searching rst for the solution for the homogeneous equation and
then making the \constant" time dependent to nd the general solution). The weight
asymptotically approaches 1 for t ! 1.
94
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
➁
: neuron winner (f2 (a0k ) = 1) and F1 :j neuron non-active (f1 (aj ) = 0), then:
F2 k
dw
jk
dt
➂
=
;
w
)
jk
w
jk
=
jk (0) e;t
w
where wjk (0) is the initial value at t = 0. The weight asymptotically decreases to 0
for t ! 1.
0
F2 :k neuron non-winner (f2 (a ) = 0), then:
k
jk
dw
dt
)
=0
w
jk
=
const.
the weights do not change.
A supplementary condition for W weights is required in order for 2/3 rule to function,
see (6.6):
w
jk >
B1
;1
(6.14)
D1
i.e. all weights have to be initialized to a value greater than BD1 ;1 1 . Otherwise the F1 :i
neuron is kept into an inactive state and the weights decrease to 0 (or do not change).
Fast Learning: If the F2 :k and F1 :j neurons are both active then the weight wjk ! 1,
otherwise it decays towards 0 or remain unchanged. A fast way to achieve the learning is
to set the weights to their asymptotic values as soon is possible, i.e. knowing the neuronal
activities:
8
>
if j; k neurons are active
<1
wjk =
(6.15)
no change if k neuron is non-active
>
:
0
otherwise
or in matrix notation:
W (:; k )new
= f1 (a)
and
W (:; `)new
=
W (:; `)old
for ` 6= k
Only column k of W is to be changed (weights related to F2 winner). f1 (a) is 1 for active neuron,
otherwise, see (6.9).
Proof.
0
6.2.4
Learning on
F2 :
The
W
0
weights
The di erential equations describing the F2 learning process are:
N
X
0
0
0
= E F (1 ; wkj )f1 (aj ) ; wkj
f1 (a` ) f2 (ak )
dt
``=1
6 j
=
N
N
P
P
f1 (a` ) ; f1 (aj ), the equations may be
f1 (a` ) =
= const. and, because
`
=1
``=1
6=j
dw
❖
E
,F
where E ; F
rewritten as:
0
kj
dw
dt
"
=
E
0
kj
!#
N
X
0
0
0
F (1 ; wkj )f1 (aj ) ; wkj
f1 (a` ) ; f1 (aj )
f2 (ak )
`=1
6.2. ART1 DYNAMICS
95
For all neurons ` on F2 except winner f2 (a0̀ ) = 0 so only the winner's weights are adapted,
all other remain unchanged (for a given input).
Analogous previous W weights, and see also (6.9) and (6.12), the following cases are
discussed:
➀ F2 :k neuron winner (f2 (a0k ) = 1) and F1 :j neuron active (f1 (aj ) = 1) then:
"
0
kj
dw
=
dt
0
F ;w
kj
E
and the solution is of the form:
0
w
kj
;1+ P
N
F
`=1
(found analogous
t
W
; exp ;
E
f1 (a
f1 (a
;1+
`)
!#
N
X
f1 (a
`=1
`)
). The weight asymptotically approaches
! 1.
N
P
`=1
t
F ;1+
PNF f
`=1
`
` ) = 0 such that the condition F
f1 (a
for
1 (a )
>
1
have to be imposed to keep weights positive.
0
F2 :k neuron winner (f2 (a ) = 1) and F1 :j neuron non-active (f1 (aj ) = 0), then:
k
dw
0
kj
dt
➂
F
`=1
!#
`)
In extreme case it may be possible that
➁
;1+
"
F
=
F
N
X
=
;
0
kj
Ew
N
X
`=1
f1 (a
)
`)
0
kj
w
=
w
0
kj (0) exp
;
Et
N
X
`=1
!
f1 (a
`)
where wkj (0) is the initial value at t = 0. The weight asymptotically decreases to 0
for t ! 1.
0
F2 :k neuron non-winner (f2 (a ) = 0) then:
k
dw
0
kj
dt
=0
)
w
0
kj
=
const.
the weight do not change.
Fast learning: If the F2 :k and F1 :j neurons are both active then the weight wkj ! 1,
otherwise it decays towards 0 or remain unchanged. A fast way to achieve the learning is
to set the weights to their asymptotic values as soon is possible, i.e. knowing the neuronal
activities:
8
>
if k; j neurons are active
PNF
>
0
kj
w
< F ; `=1 f1 a`
>
no change
>
:
1+
=
0
(
)
if k non-active
otherwise
(6.16)
or, in matrix notation:
W
0 (k; :)new =
F
F
; 1 + 1b
a)
Tf (
1
f1 (
aT ) and
W
0 (`; :)new = W 0 (`; :)old for ` 6= k (6.17)
96
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
Only row k of W 0 is to be changed (weights related to F2 winner).
otherwise, see (6.9).
Proof.
6.2.5
❖ x0 , x00 ,
k
0
,k
00
a) is 1 for active neuron,
f1 (
0
Subpatterns
The input vector is binary xi = 0 or 1. All patterns are subpatterns of the unit input vector
1b. Also it is possible that a pattern x may be a subpattern of another input vector x :
x x , i.e. either xi = xi or xi = 0. The ART1 network ensures that the proper F2
neuron will win such case may occur, i.e. the winning neurons must be di erent for x and
x | let k and k be those ones. k 6= k if x and x correspond to di erent classes.
0
0
00
0
00
00
0
0
00
0
00
0
00
0
00
When rst presented with an input vector the F1 layer outputs the same vector and distribute it to
layer, see (6.3) and (6.9) (the change in F1 activation pattern later is used just to reset the F2 layer).
So at this stage f1 (a) = x.
Assuming that k0 neuron have learned x0 , its weights should be (from (6.17)):
F
0 0
x0 T
(6.18)
W (k ; :) =
b T x0
F ;1+ 1
and for k00 neuron which have learned x00 its weights should be:
F
0 00
W (k ; :) =
x00 T
bT x00
F ;1 + 1
Proof.
F2
When x0 is presented as input, total input to k0 neuron is (output of F1 is x0 ):
F
0 = W 0 (k0 ; :) x0 =
a
x0 T x0
k
bT x0
F ;1 +1
while the total input to k00 neuron is
F
0 = W 0 (k00 ; :) x0 =
x00 T x0
a
k
bT x00
F ;1 +1
Because x0 x00 then x00 T x0 = x0 T x0 but 1bT x00 > 1bT x0 and then a0k < a0k and k0 wins as it should.
Similarly when x00 is presented as input, the output of k0 neuron is
0
00
00
0
k
a 0
=
W
0 (k0 ; :) x0 =
while the total input to k00 neuron is
0 = W 0 (k00 ; :) x00 =
a
k00
As x0 x00 and are binary vectors then
x0 T x00 = x0 T x0 = 1bT x0
0
k
a 0
and neuron k00 wins as it should.
✍
=
;
F
F
; 1 + 1bT x0
F
F
x0 T x00
; 1 + 1bT x00
x00 T x00
x00 T x00 = 1bT x00 and b1T x0 < 1bT x00
F
F ;1
1bT x0
+1
0
k
< a 00
=
➥
)
F
F ;1
1bT x00
+1
Remarks:
➥
0
The input patterns are assumed non zero otherwise x = b0 means no activation.
All inputs are subpatterns of the unit vector 1b and the neuron who have learned
the unit vector have the smallest weights:
W
0
(k; :) =
F
F
;1+
N
1bT
6.2. ART1 DYNAMICS
97
and smallest output (when the unit vector is presented as input)
a0k
➥
=
F
F ;1 + 1
N
The F2 neurons which aren't used yet should not win over neurons which were
already committed to an input vector.
Then the weights of unused neurons have to be initialized such that they do not
win in the worst case, i.e. when 1b have been already committed to a neuron.
Uncommitted neuronal weights have to be initialized with values:
w0
kj
2
0;
F
F
;1+N
(the 0 have to be avoided because it will give 0 output)
Also the values by which they are initialized should be random such that when
a new class of inputs are presented at input and none of previous committed
neurons won then only one of the uncommitted neurons wins.
6.2.6
The Reset Unit
The reset neuron is set to detect mismatches between the input vector and the output of
the F1 layer.
At start, when an input is present, the output of F1 is identical to the input and the reset
unit should not activate.
Also the reset unit should not activate if the di erence between input and F1 output is
below some speci ed value. Di erences between input and stored pattern appear due to
noise, missing data or small di erences between vectors belonging to same class.
All inputs are of equal importance so they receive the same weight; the same happens with
F1 outputs but they came as inhibitory input to the reset unit. See gure 6.1 on page 86.
Let Q be the weight(s) for inputs and ;S the weight(s) for F1 connections (Q; S > 0).
The total input to the reset unit is aR :
aR
=
Q
N
X
i=1
xi ; S
N
X
i=1
f1 (ai ) ;
Q; S
The reset unit activates if the net input is positive:
Q
N
X
i=1
xi ; S
N
X
i=1
f1 (ai ) > 0
where is called the vigilance parameter. For
,
=
const.
;
Q; S
2R
❖ Q, S , aR
+
N
P
f1 (ai )
Q
i=1
<
N
P
S
xi
i=1
PN f (a )
i=1 1 i
PN x
i=1 i
=
> the reset unit does not trigger.
Because at the beginning (before F2 activate) f1 (a) = x then the vigilance parameter
should be 6 1, i.e. Q 6 S (otherwise it will always trigger).
❖
98
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
Noise in data
The vigilance parameter self scale the di erence between the actual input pattern and the
stored/learned one. Depending upon the input pattern the di erence (noise, distance, e.t.c.)
between the two may or may not trigger the reset unit.
For the same di erence (same set of ones in input vector) the ratio between noise and
information varies depending upon the number; of ones in the input
vector, i.e. assuming
T
that the noise vector is the smallest one x = 0 : : : 1 : : : 0 (just one \1") and the
input stored vector is similar (have also just one \1") then for an input with noise the ratio
between noise and data is at least 1 : 1; for an input having two ones the ratio drops to half
0:5 : 1 and so on.
New pattern learning
If the reset unit is activated then it deactivates the winning F2 neuron for a sucient long
time such that all committed F2 neurons have a chance to win (and see if the input pattern
is \theirs") or a new uncommitted neuron is set to learn a new class of inputs.
If none of the already used neurons was able to establish a \resonance" between F1 and F2
then an unused (so far) neuron k (from F2 ) win. The activity of the F1 neurons are:
x + D1 W (j; :) f2 (a0 ) ; B1 = x + D1 w ; B1
1 + A1 [x + D1 W (j; :) f2 (a0 )] + C1
1 + A1 (x + D1 w ) + C1
(see (6.4) and because f2 (a0 ) is 1 just for winner k and zero in rest and then W (j; :) f2 (a0 ) =
w ). For newly committed F2 neurons the weights (from F2 to F1 ) are initialized to a
value w > 1 ;1 1 (see (6.14)) and then:
x + D1 w ; B 1 > x ; 1
which means that for x = 1 the activity a is positive and f1(a ) = 1 while for x = 0,
because of the 2/3 rule, the activity is negative and f1 (a ) = 0.
Conclusion: when a new F2 neuron is committed to learn the input, the output of F1 layer
a
j
j
=
j
j
jk
j
jk
jk
jk
B
D
j
j
jk
j
j
j
j
j
is identical to the input, the reset neuron does not activate, and the learning of the new
pattern begins.
✍
Remarks:
➥ The F2 layer should have enough neurons for all classes to be learn, otherwise an
➥
overloading of neurons, and consequently instabilities, may occur.
Learning of new patterns may be stopped and resumed at any time by allowing
or denying the weight adaptation.
Missing data
Let assume that an input vector, which is similar to a stored/learned one, is presented to
the network. Let consider the case where some data is missing, i.e. some components of
input are 0 where the stored one have 1's. Assuming that it is \far" enough from other
stored vectors, only the designated F2 neuron will win | that one which previously learned
the complete pattern (there is no reset).
Assuming that a reset does not occur, the vector sent by the winning F1 neuron to the F2
layer will have more 1-s than the input pattern (after one transmission cycle between F1 2
;
6.3. THE ART1 ALGORITHM
99
layers). The corresponding weights (see (6.15): j non-active because of 2/3 rule, k active)
are set to 0 and eventually the F2 winner learns the new input | assuming that learning
was allowed, i.e. weights are free to adapt.
If the original, complete, input vector is applied again the original F2 neuron may learn again
the same class of input vectors or otherwise a new unassigned F2 neuron may be committed
to learn.
This kind of behavior may lead to a continuous change in the class of vectors represented
by the F2 neurons, if learning is always allowed.
➧ 6.3
The ART1 Algorithm
Initialization
The size of F1 layer: N is determined by the dimension of the input vectors.
1. The dimension of the F2 layer K is based on the desired number of classes to be
learned now and later. Note also that in special cases some classes may require to be
divided into \subclasses" with di erent assigned F2 winning neurons.
2. Select the constants: A1 > 0, C1 > 0, D1 > 0, max(1; D1 ) < B1 < D1 + 1, F > 1
and 2 (0; 1].
3. Initialize the W weights with random values such that:
wjk > B1 ; 1 ; j = 1; N ; k = 1; K
D1
4. Initialize W weights with random values such that:
0
F
wkj 2 0; F ; 1 + n ; k = 1; K ; j = 1; N
0
Network running and learning
The algorithm uses the fast learning method (asymptotic values for weights).
1. Apply an input vector x. The F1 output becomes
f1 (a) = x
(at rst run, x is a binary vector).
2. Calculate the activities of F2 neurons and nd the winner. The neuron with the
biggest input from F1 wins (and all other will have zero output). For k the F2 winner
it is true that:
W (k; :) f1 (a) = max
W (`; :) f1 (a)
`
0
0
3. Calculate the new activities of F1 neurons caused by inputs from F2 . The F2 output is
a vector which have all its components 0 with one exception for winner k, multiplying
W by such a vector means that column W (:; k) is selected to become the activity of
F1 and the new F1 output becomes:
f1(a)new = sign(W (:; k))
100
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
4. Calculate the \degree of match" between input and the new output of F1 layer
N
f1 (aj )new
1T f1 (a)new
=
degree of match = j=1 N
1T x
xj
j=1
P
P
b
b
5. Compare the \degree of match" computed previously with the vigilance parameter .
If the vigilance parameter is bigger than the \degree of match" then
(a) Mark the F2 :k neuron as inactive for the rest of the cycle while working with
the same input x.
(b) Restart the procedure with the same input vector.
Otherwise continue, the input vector was positively identi ed (assigned, if the winning
neuron is a previously unused one) as being of class k.
6. Update the weights (if learning is enabled; it have to be always enabled for new
classes). See (6.15) and (6.16).
W (:; k )new
W 0 (k; :)new
= f1 (a)new
=
F
F
; 1 + b1T f1 (a)new
f1 (aT )new
only the weights related to the winning F2 neuron being updated.
7. The information returned by the network is the classi cation of the input vector given
by the winning F2 neuron (in the one-ofk encoding scheme).
➧ 6.4
❖ w, s, u, v , p, q
The ART2 Architecture
Unlike ART1, the ART2 network is designed to work with analog positive inputs. There
is a broad similarity with ART1 architecture: there is a F1 layer which sends its output to
a F2 layer and a reset layer. The F2 layer is of \winner-takes-all" type and the reset unit
have the same role as in ART1. However the F1 layer is made of 6 sublayers labeled w, s,
u, v , p and q .
See gure 6.3 on the facing page. Each of the sublayers have the same number of neurons
as the number of components in the input vector.
➀ The input vector is sent to the w sublayer.
➁ The output to the w sublayer is sent to s sublayer.
➂ The output of the s sublayer is sent to the v sublayer.
➃ The output of the v sublayer is sent to the u sublayer.
➄ The output of the u sublayer is sent to the p sublayer, to the reset layer r and back
to the w sublayer.
6.4
See [FS92] pp. 316{318.
6.4. THE ART2 ARCHITECTURE
101
Input
F1
layer
w sublayer
u sublayer
G
s sublayer
p sublayer
G
Reset layer
Figure 6.3:
➅
✍
v sublayer
G
G
R
G
q sublayer
F2
layer
The ART2 network architecture. Thin arrows represent
neuron-to-neuron connections between (sub)layers; thick
arrows represent full inter-layer connections (from all
neurons to all neurons). The G units are gain-control
neurons which sent an inhibitory signal; the R unit is the
reset neuron.
The output of the p sublayer is sent to the q sublayer and to the reset layer. The
output of the p sublayer represents also the output of the F1 layer and is sent to F2 .
Remarks:
➥
➥
➥
Between sublayers there is a one-to-one neuronal connection (neuron j from one
layer to the corresponding neuron j from the other layer)
All (sub)layers receive input from 2 sources, a supplementary gain-control neuron
have been added where necessary such that the layers may be complacent with
the 2/3 rule.
The gain-control unit have the role to normalize the output of the corresponding
layers (see also (6.20) equations); note that all layers have 2 sources of input
either from 2 layers or from a layer and a gain-control unit. The gain-control
neuron receive input from all neurons from the corresponding sublayer and send
the sum of its input as inhibitory input (see also (6.21) and table 6.1), while the
other layers sent an excitatory input.
102
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
➧ 6.5
ART2 Dynamics
6.5.1
❖
a, b, e
The
F1
layer
The di erential equations governing the F1 sublayers behavior are:
dwj
dvj
= ;wj + xj + auj
= ;vj + f (sj ) + bf (qj )
dt
dt
dpj
dsj
= ;esj + wj ; sj kwk
= ;pj + uj + W (j; :) f2 (a )
dt
dt
dqj
duj
= ;euj + vj ; uj kvk
= ;eqj + pj ; qj kpk
dt
dt
where a; b; e = const.. The f function determines the contrast enhancement which takes
place inside F1 layer, a possible de nition would be
0
❖
f,
f (x) =
(
for x <
x otherwise
0
(6.19)
where 2 (0; 1), = const.. The norm of a vector x is here de ned as the Euclidean.
When applied to a vector the contrast enhancement function may be written in matrix
format as:
f (x) = x sign(sign(x ; 1b) + 1b)
The equilibrium values (from ddt( ) = 0) are:
wj = sj + auj
vj = f (sj ) + bf (qj )
wj
sj =
pj = uj + W (j; :) f2 (a )
e + kwk
pj
vj
qj =
uj =
e + kvk
e + kpk
0
(6.20)
These results may be described by the means of one single equation with di erent parameters
for di erent sub-layers | see table 6.1:
d(neuron output)
= ; C1 neuron output + excitatory input
(6.21)
dt
; C2 neuron output inhibitory input
✍
Remarks:
➥
➥
6.5
The same (6.21) is applicable to the reset layer with the parameters in table 6.1
(c = const.).
The purpose of the e constant is to limit the output of s, q, u and r (sub)layers
when their input is 0 and consequently e should be chosen e & 0 and may be
neglected when real data is presented to the network.
See [FS92] pp. 318{324 and [CG87].
6.5. ART2 DYNAMICS
103
Layer Neuron output
C2
1
1
1
1
1
1
1
w
wj
1
s
sj
e
u
uj
e
v
vj
p
pj
1
1
q
qj
e
r
rj
e
Table 6.1:
6.5.2
C1
The
F2
Excitatory input
xj
Inhibitory input
+ auj
0
kwk
kvk
wj
vj
f (sj ) + bf (qj )
uj
+ W (j; :) f2 (a )
0
pj
uj
+ cpj
0
0
kpk
kuk + ckpk
The parameters for the general, ART2 di erential equation (6.21).
Layer
The F2 layer of ART2 network is identical to the F2 layer of ART1 network. The total
input into neuron k is ak = W (k; :) f1 (a) and the output is:
0
0
f2 (ak ) =
(
0
d
0
if ak = max
a`
`
otherwise
0
0
(6.22)
where d = const., d 2 (0; 1).
Then the output of p sublayer becomes:
pj
6.5.3
=
(
uj
uj
+ dwjk
❖d
if F2 is inactive
for k neuron winner on F2
(6.23)
The Reset Layer
The di erential equation de ning the reset layer running is of the form given by (6.21) with
the parameters de ned in table 6.1:
drj
dt
= ;erj + uj + cpj ; rj kuk ; crj kpk
with the inhibitory input given by the 2 gain-control neurons.
The equilibrium value is:
uj + cpj
' uj + cpj , r = kuuk ++ ccpkpk
rj =
e + kuk + ckpk kuk + ckpk
(6.24)
(see also the remarks regarding the value of e).
By de nition | considering the vigilance parameter | the reset occurs when:
k rk <
❖
(6.25)
104
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
The reset should not activate before an output from F2 layer have arrived | this condition
is used in the ART2 algorithm | and indeed from (6.20) equations, if f2 (a0 ) = 0b then
p = u and then (6.24) gives krk = 1.
6.5.4
Learning and Initialization
The di erential equations governing the weights adaptation are de ned as:
dwjk
dt
0
= f2 (ak )(pj
; wjk )
and
0
dwkj
dt
=
f2 (a0k )(pj
; wkj0 )
and, considering (6.22) and (6.23):
dwjk
dt
0
dwkj
dt
(
=
(
=
d(uj
+ dwjk
; wjk ) for k winner on F
2
+ dwjk
; wkj0 ) for k winner on F
2
otherwise
0
d(uj
otherwise
0
Fast Learning
The weights related to winning F2 neuron are updated, all others remain unchanged. The
equilibrium values are obtained from ddt() = 0 condition. Assuming that k is the winner:
uj
+ dwjk
; wjk = 0
and then
wjk
=
0
wkj
=
uj
+ dw(jk
8
<W (:; k) =
) : 0
W (k; :) =
uj
1
and
;d
; wkj0 = 0
u
;d
1
uT
;
=
1 d
W (:; k )T
(6.26)
(so this is why the condition 0 < d < 1 is necessary). Eventually the W 0 weight matrix
becomes the transposed of W matrix | when all the F2 neurons have been used to learn
new data.
New pattern learning
The reset unit should not activate when a learning process takes place.
p
Using the fact that uj = kvvjk (e ' 0) and then kuk = 1 and also krk = rT r, from (6.24):
p
kpk cos(ud
; p) + c kpk
(6.27)
1 + ckpk
where ud
; p is the angle between u and p vectors and if p k u then a reset does not occur
because < 1 and the reset condition (6.25) is not met (krk = 1).
If the W weights are initialized to 0e then the output of F , at the beginning of the learning
krk =
process,
1 + 2c
2
2
2
is zero and p = u (see (6.23)) such that the reset does not occur.
During the learning process the weight vector W (; ; k), associated with connection from F2
winner to F1 (p layer), becomes parallel to u (see (6.26)) and then, from (6.23), p moves
6.5. ART2 DYNAMICS
105
(during the learning process) towards becoming parallel with u and again a reset does not
occur.
e.
Conclusion: The W weights have to be initialized to 0
Initialization of
W
0 weights
Let assume that a k neuron from F2 have learned a input vector and, after some time,
is presented again to the network. The same k neuron should win, and not one of the
uncommitted (yet) F2 neurons. This means that the output of the k neuron, i.e. a0k =
0
0̀
0
W (k; :) p should be bigger than an a = W (`; :) p for all ` unused F2 neurons:
W
0 (k; :) p > W (`; :) p
) k
W
\
0 (k; :)k kpk > kW 0(`; :)k kpk cos(W 0 (`; :); p)
because p k u k W 0 (k; :), see (6.23) and (6.26) (the learning process | weight adaptation
| has been done already previously for k neuron).
\
The worst possible case is when W 0 (`; :) k p such that cos(W 0 (`; :) p) = 1. To ensure that
no other neuron but k wins, the condition:
k
W
0 (`; :)k < kW 0 (k; :)k =
1
1
;
=
d
p
1bT
K
(1
;
(6.28)
d)
have to be imposed for unused ` neurons (for a committed neuron
kuk = 1 as uj ' kvvjk , e ' 0).
W
0 (k; :)
T
u
;d and
= 1
To maximize the unused neurons input a0̀ such that the network will be more sensitive to
new patterns the weights of (unused) neurons have to be uniformly initialized with maximum
values allowed by the condition (6.28), i.e. w0̀ j . pK (11 ;d) .
1
0
0
Conclusion: The W weights have to be initialized with: wkj . p
K (1;d) .
Constants Restrains
As p = u + dW (:; k) and kuk = 1 then:
\
\
\
up = kpk cos(ud
; p) = 1 + dkW (:; k )k cos(u; W (:; k ))
q
p
kpk = pT p = 1 + 2dkW (:; k)k cos(u; W (:; k)) + d2 kW (:; k)k2
(k being the F2 winner). Replacing kpk cos(ud
; p) and kpk into (6.27) gives:
q
krk =
\
(1 + c)2 + 2(1 + c)cd
1+
p
2
c
k
+ 2c
k cos(u
k (: )k +
W (:; k )
cd W
; W (:; k ))
;k
k
+ c2 d2
k
k
k
W (:; k )
2
2 2 W (:; k ) 2
c d
Figure 6.4 on the next page shows the dependency of krk as function of cdkw(1)k k,
cos u; w(1)k and c | note that krk decreases for cdkw(1)k k < 1.
Discussion:
The learning process should increase the mismatch sensitivity between the F1 pattern
sent to F2 and the classi cation received from F2 , i.e. at the end of the learning process,
106
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
1:0
1:0
1
2
krk
0:9
0
0:8
0:7
krk
0:9
0
0:8
0
1
2
3
4
\
5
0:7
0
2
1
4
5
cdkW (:; k )k
cdkW (:; k )k
a) variable cos(u; W (:; k))
Figure 6.4:
3
b) variable c
\
krk as function of cdkW ((:; k))k. Figure (a) shows de-
pendency for various angles u; W (:; k), from =2 to 0 in
steps, and c = 0:1 constant. Figure (b) shows dependency for various c values, from 0:1 to 1:9 in 0:2 steps,
and u; W (:; k) = =2 ; =20 constant.
=20
\
the network should be able to discern better between di erent classes of input. This
means that, while the network learns a new input class and W (:; k) increases from
initial value 0b to the nal value when kW (:; k)k = 1;1 d , the krk value (de ning the
network sensitivity) have to decrease (such that reset condition (6.25) became easier
to met). In order to achieve this the following condition have to be imposed:
cd
1
;d 61
(at the end of the learning W (:; k) = 1;1 d ).
Or, in other words, when a there is a perfect t krk reaches its maximal value 1;
when presented with a slightly di erent input vector, the same F2 neuron should win
and adapt to the new value. During this process the value of krk will rst decrease
before increasing back to 1. When decreasing it may happen that krk < and a rest
occurs. This means that the input vector does not belong to the class represented by
the current F2 winner.
For 1cd;d . 1 the network is more sensitive than for 1cd;d 1 | the krk value will
drop more for the same input vector (slightly di erent from the stored/learned one).
See gure 6.4{a.
For c 1 the network is more sensitive than for c . 1 | same reasoning as above.
See gure 6.4{b.
What happens in fact into F1 layer may be explained now:
{ s, u and q just normalize w, v and p outputs before sending it further.
6.6. THE ART2 ALGORITHM
107
s
w
q
v
u
Figure 6.5:
The
ers.
F1
p
u
dynamics: data communication between sublay-
{ There are connection between p ! v (via q) and w ! v (via s) and also
back v ! w; p (via u). See gure 6.5. Obviously v layer acts as a mediator
between input x received via w and the output of p activated by F2 . During this
negotiation u and v (as u is the normalization of v) move away from W (:; k)
(krk drops). If it moves too far then an reset occurs (krk becomes smaller than
) and the process starts over with another F2 neuron and a new W (:; k ). Note
that u represents eventually a normalized combination ( ltered trough f ) of x
and p (see (6.20)).
0
➧ 6.6
The ART2 Algorithm
Initialization
The dimensions of w, s, u, v, p,pq and r layers equals N (dimension of input vector). The
norm used is Euclidean: kxk = xT x.
1. Select network learning constants such that:
a; b > 0
2 (0; 1)
d 2 (0; 1)
c 2 (0; 1)
cd
1
;d 61
and the size of F2 (similar to ART1 algorithm).
2. Choose a contrast enhancement function, e.g. (6.19).
3. Initialize the weights:
W
W0
e
=0
0
wkj
.p
1
K (1 ; d)
to be initialized with random values such that the above condition is met.
Network running and learning
1. Pickup an input vector x.
108
CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART)
2. First initialize:
u = b0
and
q = 0b
and then iterate the following steps till the output values of F1 sublayers stabilize:
w
kw k !
! u = kvvk !
w = x + au
!
v = f (s) + bf (q)
!
p=
(
u
at rst iteration
u + dW (:; k ) on next iterations
!
s=
!
q=
p
kpk
(F2 is inactive at rst iteration). Note that usually two iterations are enough.
3. Calculate the output of the r layer:
u + cp
r=
kuk + kpk
If there is a reset, i.e. krk > then the F2 winner (there should be no reset at rst
pass as krk = 1 > ) is made inactive for the current input vector and go back to
step number 2. If there is no reset (step 4 is always executed at least once) | a
winner was found on F2 | then the resonance was found and jump to step 6.
4. Find the winner on F2 . First calculate total inputs a = W p then nd the winner k
for which ak = max
a` and nally
`
0
0
0
0
b
f2 (a0 ) = 0
and afterwards
f2 (a0k ) = d
as F2 is a contrast enhancement layer (see (6.22)).
5. Go back to step 2 ( nd the new output values for F1 layers).
6. Update the weights (if learning is allowed):
W (:; k ) = W 0 (k; :)T
u
= 1;
d
7. The information returned by the network is the classi cation of the input vector given
by the winning F2 neuron in one-of-k encoding.
Basic Principles
CHAPTER
7
Pattern Recognition
➧ 7.1
Patterns: The Statistical Approach
7.1.1 Patterns and Classi cation
In most general way an intelligent behavior (living organism, arti cial intelligence machines)
is represented by the characteristic of being able to recognize/classify patterns | taken in
its broadest de nition. A pattern may represent a class of objects, a sequence of movements
or even a mixture of feelings. The way the intelligent system reacts to the recognition of a
pattern may also be considered a pattern.
A quantized pattern is represented by a vector x Let C , k = 1; K be the classes with
respect to which the pattern x have to be classi ed.
k
✍
Remarks:
Many classes of patterns overlap and, in many cases, it may be quite dicult
to classify a pattern to a certain class. For this reason a statistical approach is
taken, i.e. a class membership probability is attached to patterns.
➥ One of the things which may be hard sometimes is to quantize the pattern into
a set of numbers such that a processing can be done on them. An image may be
quantized into a set of pixels, a sound may be quantized into a set of frequencies
associated width pitch, volume and time duration, e.t.c.
The pattern x have associated a certain probability to being of class C | di erent for
each class. This probability is a function of variables fx g =1 . The main problem is to
nd/build these functions as to be able to give reliable results on previously unseen patterns
➥
k
i
7.1.1
See [Bis95] pp. 1{6 and [Rip96] pp. 5, 24.
111
i
;N
❖ Ck
112
CHAPTER 7. PATTERN RECOGNITION
probability
C
C
k0
k00
overlap
xi
Figure 7.1:
xi
Overlapping probability: Considering only the xi component it's more probable that the x pattern is of class Ck ,
however it is possible for x to be of class Ck .
00
0
weights
❖
(generalization), i.e. to build a statistical model. The probabilities may overlap | see
gure 7.1.
In ANN eld, in most cases the probabilities are given by an vector y representing the
network output and it's a function of its input pattern x and some parameters named
weights collected together in a matrix W (or several matrices):
W
X, Y
learning
❖
training set,
generalization
classi cation,
regression
outliers
y = y(x; W )
Usually the mapping from the pattern space X to the output space Y is non-linear. ANN
o ers a very general framework for nding out the mapping y : X ! Y (and are particularly
ecient on building nonlinear models). The process of nding the adequate W weights is
called learning.
The probabilities and the classes are usually determined from a learning data set already
classi ed by a supervisor. The ANN learns to classify from the data set | this kind of
learning is called supervised learning. At the end of the learning process the network should
be able to correctly classify an previously unseen pattern, i.e. to generalize. The data set
is called a sample or a training set | because the ANN trains/learns using it | and the
supervisor is usually a human, however there are neural networks with unsupervised learning
capabilities. There is also another type called reinforced learning where there is no desired
output present into the data set but there is a positive or negative feedback depending on
output (desired/undesired).
If the output variables are part of a discrete, nite set then the process of neural network
computing will be called classi cation; for continuous output variables it will be called
regression1
✍
Remarks:
➥
1
It is also possible to have a (more or less) special class containing all patterns x
which were classi ed (with some con dence) as not belonging to any other class.
Regression refers to functions de ned as an average over a random quantity.
7.1. PATTERNS: THE STATISTICAL APPROACH
➥
➥
7.1.2
These patterns are called outliers. The outliers usually appear due to insucient
data.
In some cases there is a \cost" associated to a (mis)classi cation. If the cost
is too high and the probability of misclassi cation is also (relatively) high then
the classi er may decline to classify the input pattern. These patterns are called
rejects or doubts.
Two more \classes" with special meaning may be considered: O for outliers and
D for rejected patterns (doubts).
Remarks:
The performance of a network can be measured as the percent of correct outputs
(correct classi cation, e.t.c.) with regard to the total number of inputs | after
the learning process have nished.
The process of extracting the useful information and translating the pattern into a vector is
7.1.2
rejects
❖ O, D
Feature Extraction
Usually the straight usage of a pattern (like taking all the pixels from a image and transforming it into a vector) may end up into a very large pattern vector. This may pose some
problems to the neural networks in the process of classi cation because the training set is
limited.
Let assume that the pattern vector is unidimensional x = (x1 ) and there are only 2 classes
such that if x1 is less than some threshold value xe1 then is of class C1 otherwise is of class
C2 . See gure 7.2{a.
Let now add a new feature/dimension to the pattern vector: x2 such that it becomes
bidimensional x = (x1 ; x2 ). There are 2 cases: either x2 is relevant to the classi cation or
is not.
If x2 is not relevant to the classi cation (does not depend on its value) then it shouldn't
have been added; it just increases the useless/useful data ratio, i.e. it just increase the noise
(useless data is noise and each xi component may bear some noise embedded in its value).
Adding more of irrelevant components may increase the noise to a level where it will exceed
the useful information.
If is relevant, then it must have a threshold value xe2 such that if x2 is lower then it is
classi ed in one class, let C1 be that one (the number is irrelevant for the justi cation of the
argument, classes may be just renumbered in a convenient way) otherwise is of C2 . Now,
instead of just 2 cases to be considered (x1 less or greater that xe1 ) there are four cases (for
x1 and x2 ). See
gure 7.2{b. The number of patterns into the training set being constant,
the number of training patterns per each case have halved (assuming a large number of
training pattern vectors spread evenly). A further addition of a new feature x3 increases
the number of cases to 8. See gure 7.2{c.
In general the number of cases to be considered increases exponentially, i.e. K N . The
training set spreads thinner into the pattern space. The performance of the neural network
with respect to the dimension of the pattern space have a peak and increasing the dimension
further may decreases it. See gure 7.2{d. The phenomena of performance decreasing as
dimensionality of pattern space increases is known as the course of dimensionality.
✍
113
➥
See [Bis95] pp. 6{9.
course of
dimensionality
network
performance
feature
extraction
114
CHAPTER 7. PATTERN RECOGNITION
x2
C1
e
x2
e
x1
C2
x1
e
a) One dimension, 2 cases
x1
x1
b) 2 dimensions, 4 cases
x3
e
network performance
x3
e
e
x2
x1
x2
x1
n
c) 3 dimensions, 8 cases
Figure 7.2:
d) The curse of dimensionality
The curse of dimensionality: The increase of pattern vector dimension n may cause a worsening in neural network
performance. Patterns are represented as dots in the pattern space. The same number of training patterns have
to be spread \thinner" if there are more dimensions.
called feature extraction. This process may be manually, automatic or even done by another
neural network and takes place before entering the actual neural network.
7.1.3 Model Complexity
❖
xp
,t
p
For a supervised learning, the training set consists of pairs fx ; t g =1 , where t is the
desired network target output vector given input x .
The W weights are to be found by trying to minimize an error function E = E (x; t; W ).
The most widely used error function is the sum-of-squares de ned as:
p
p
❖
E
E
7.1.3
See [Bis95] pp. 9-15.
=
1
2
X
P
p
=1
[y(xp ; W )
; t ]2
p
p
p
;P
p
7.1. PATTERNS: THE STATISTICAL APPROACH
C1
x2
C1
x2
C2
Figure 7.3:
C1
x2
C2
C2
x1
a) medium complexity
115
x1
x1
b) low complexity
c) high complexity
Model complexity. That training set patterns are marked
with circles, new patterns are marked with squares, exceptions are marked with .
another one being root-mean-square ERMS :
vuu 1 X
[ (
RMS = t
E
P
P p=1
y xp ; W
❖ ERMS
) ; tp ]2
In solving this problem an arbitrary complex model may be build. However there is a trade-o
between exception handling and generalization.
✍
Remarks:
➥
➥
Exceptions are patterns which are more likely to me member of one class but in
fact are from another. Misclassi cation usually happens due to overlapping of
probability (see also gure 7.1) noise in data or missing data.
There may be also a fuzziness between di erent classes, i.e. they may not be well
de ned.
A reasonably complex model will be able to handle (recognize/classify) new patterns and
consequently to generalize. See gure 7.3{a. A too simple model will have a low performance | many patterns misclassi ed. See gure 7.3{b. A too complex model may
well handle the exceptions present in the training set but may have poor generalization
capabilities. See gure 7.3{c.
One widely used way to control the complexity of the model is to add a regularization term
to the error function:
e=
E
E +
which is high for complex models and low for simple models. The parameter controls the
weight by which in uences E .
regularization
❖ ,
116
CHAPTER 7. PATTERN RECOGNITION
7.1.4 Classi cation: Making Decisions and Minimizing Risk
decision
boundaries
Bayes rule
A neural network maps the pattern space X to the classes of patterns. The pattern space is
divided into a K number of areas Xk (which may be of any possible form) such that if the
pattern vector x 2 Xk it is classi ed as being of class k. These areas are named decision
regions and the boundaries between them are named decision boundaries.
The problem consists in nding the decision boundaries such that the errors of misclassi cation are minimized or the correctness of the classi cation is maximized.
Theorem 7.1.1. Bayes rule. Being given a pattern vector x to be classi ed into one of
the classes fCk gk=1;K | the probability of misclassi cation is minimized if it is classi ed
into class Ck for which the posterior probability is maximal:
P (Ck jx) = max P (C` jx)
`=1;K
Proof. The decision boundaries are found by maximizing the correctness of the classi cation Let consider
one nite decision region X1 X such that all pattern vectors belonging to it will be classi ed as C1 . The
probability of making a correct classi cation if x 2 X1 is the join-probability associated with that class and
decision region P (C1 ; X1 ). Considering two decision region and two classes all patterns with their pattern
vectors in X1 and belonging to class C1 will be classi ed correctly and the same happens for X2 and C2
| such that the probability of making a correct classi cation is the sum of the two join-probabilities, i.e.
P (C1 ; X1 ) + P (C2 ; X2 ). In general, for K classes and respectively decision regions:
K
X
P (Ck ; Xk )
Pcorrect =
k=1
The join probability may be written as the product between class-conditional and prior probability2 :
P(
R
Ck ` ) =
P (X
;X
also as P (X` jCk ) = X` p(xjCk ) dx then:
Pcorrect
=
K
X
k=1
P (X
k jCk ) (Ck ) =
P
` jCk ) (Ck )
P
K Z
X
k=1 Xk
Each Xk should be chosen such that, inside it, the integrand
integrand p(xjC` ) P (C` ), for ` 6= k:
xjCk ) P (Ck ) dx
p(
xjCk ) P (Ck ) is greater that any other
p(
xjCk ) P (Ck ) > p(xjC` ) P (C` ) ; 8` 2 f1; : : : ; K j` 6= kg
, p(xjCpk()xP) (Ck ) > p(xjCp`()xP) (C` ) (because p(x) > 0)
, P (Ck jx) > P (C` jx) (from Bayes theorem)
p(
✍
❖
M,
P
mc
Remarks:
The decision boundaries are placed at the points where the highest posterior
probability P (Ck jx) becomes smaller than another P (C`jx). See gure 7.4 on
the facing page.
De nition 7.1.1. Given a mapping (classi cation procedure) M : X ! f1; : : : ; K; Dg the
probability of misclassi cation a vector x of class Ck is
➥
P
mc ( ) = (M(x) 6= Ck
k
P
;
M(x) 2 fC
1; : : : ;
CK gjx 2 Ck ) = X
K
``==1k
6
2
See the statistical appendix.
( `jCk )
P X
7.1. PATTERNS: THE STATISTICAL APPROACH
probability
decision boundary
P (C2 jx)
P (C1 jx)
X1
Figure 7.4:
117
X2
x
Decision boundary for a unidimensional pattern space
and two classes.
Z
2 C ) = p( jC ) d
The doubt probability P (i) is de ned similarly:
d
P (k) = P (M(x) = Djx
d
x
k
X
k
P (Djx)
D
K
d
=1
More general, the decision boundaries may be de ned with the help of a set of discriminant
functions fy (x)g =1 such that an pattern vector x is assigned to class C if
k
❖
d
X P (k)
k
k
P
x
The total doubt probability is the probability for a pattern from any class to be unclassied(i.e. classi ed as doubt)
P (Djx) =
❖
k
;K
discriminant
functions
y (x) = max y (x)
k
`
=1;K
`
and in particular case y (x) = P (C jx) as in theorem 7.1.1.
In particular cases (in practice, e.g. in medical eld) there may be necessary to increase
the penalty of misclassi cation of a pattern from one class to another. Then a loss matrix de ning penalties for misclassi cation is used: let L be the penalty associated to
misclassi cation of a pattern belonging to class C as being of class C .
k
k
k`
k
✍
Remarks:
➥
Penalty may be associated with risk: high penalty means high risk in case of
misclassi cation.
The diagonal elements of the loss matrix should be L = 0 because, in this case, there is
no misclassi cation so no penalty.
The penalty associated with misclassi cation of a pattern x 2 C into a particular class C
is L multiplied by the probability of misclassi cation P (X jC ) (the probability that the
kk
k
k`
loss matrix
`
`
k
`
risk
118
CHAPTER 7. PATTERN RECOGNITION
pattern vector is in X` but is of class Ck ).
Rk`
✍
= Lk` P (X` jCk ) = Lk`
Z
X`
jCk ) dx
p(x
Remarks:
The loss term Lk` have the role of increasing the e ect of misclassi cation probability P (X` jCk ).
The total penalty associated with misclassi cation of a pattern x 2 Ck in any other class is
the sum, over all classes, of penalties for misclassi cation of x into another class
➥
=
Rk
❖ PX jC
K
X
`=1
Rk`
=
K
X
`=1
Lk` P (X`
jCk ) =
K
X
`=1
Z
Lk`
X`
jCk ) dx
p(x
(7.1)
or, considering the matrix PX C = fP (X`jCk )g`;k (` row, k column index) then
j
= L(k; :) PX C (:; k)
Rk
j
(Rk are the diagonal terms of LPX C product).
The total penalty for misclassi cation is the sum of penalties associated with misclassi cation of a pattern x 2 Ck into any other class multiplied
by the probability
that such penalty
;
T
may occur, i.e. P (Ck ). De ning the vector PC = P (C1 ) : : : then:
j
R
=
K
X
k=1
Ck ) = PC [(L
Rk P (
T
b]
PX jC )1
T
=
K Z
X
"
K
X
`=1 X` k=1
#
jCk ) P (Ck )
Lk` p(x
;
dx
(7.2)
R1 : : : . L PXT C creates the
elements of sum appearing in (7.1) while multiplication by 1e sums the L PXT C matrix on rows.
Proof.
R represents the multiplication between PC and
RT
=
j
j
The penalty is minimized when the X` areas are chosen such that the integrand in (7.1) is
minimum:
x 2 X`
✍
)
K
X
k=1
jCk ) P (Ck ) = min
m
Lk` p(x
K
X
k=1
jCk ) P (Ck )
Lkm p(x
(7.3)
Remarks:
➥
(7.3) is equivalent with theorem 7.1.1 if the penalty is 1 for any misclassi cation,
i.e. Lk` = 1 ; k` (ij being the Kronecker symbol).
Proof.
Indeed, in this case (7.3) becomes:
K
X
k=1
k6=`
p(xjCk ) P (Ck ) <
K
P
K
X
k=1
k6=m
p(xjCk ) P (Ck ) for x 2 X` ; 8m 6= `
K
P
p(xjCk ) P (Ck ) = p(xjCk ) P (Ck ) from above equation,
and by subtracting the identity
k=1
k=1
nally
p(xjC` )P (C` ) > p(xjCm )P (Cm ) ; 8` 2 f1; : : : ; K j` 6= mg
7.1. PATTERNS: THE STATISTICAL APPROACH
probability
P
decision boundary
reject area
(C1 j )
x
P
X1
Figure 7.5:
119
(C2 j )
X2
x
x
Reject area around a decision boundary between two
classes, into a unidimensional pattern space.
which is equivalent to
P (C` jx) > P (Cm jx)
(see the proof for theorem 7.1.1).
In general, most of the misclassi cation occur in the vicinity of decision boundaries where
the di erence between the top-most posterior probability and the next one is relatively low.
If the penalty/risk of misclassi cation is very high then is better to de ne a reject area
around the decision boundary such that all pattern vectors inside it are rejected from the
classi cation process (to be analyzed by a higher instance, e.g. a human or a slower but
more accurate ANN), i.e. are classi ed in class D. See gure 7.5.
Considering the doubt class D as well, with the associated loss L D , then the risk terms
change to:
k
Rk
=
X
K
`
Rk`
=1
Lk
k
`
=1;K
or x is classi ed as D otherwise, i.e. P (C jx) < P (Djx), 8` = 1; K .
2. Considering loss, the best classi cation is obtained when x is classi ed as C if
R = min R < RD
`
k
k
`
❖
d
;K
k
k
`
Lk
Pd k
k
k
❖
+ D ( )
and the total penalty (7.2) also changes accordingly.
To accommodate the reject area and the loss matrix, the Bayes rule 7.1.1 is changed as in
the proposition below.
Proposition 7.1.1. (Bayes rule with reject area and loss matrix). Let consider a pattern
vector x to be classi ed in one of the classes fC g =1 or D. Let consider the loss matrix
fL g and also the loss L D = d = const., 8k , for the doubt class. Then:
1. Neglecting loss, the best classi cation is obtained when x is classi ed as C if
P (C jx) = max P (C jx) > P (Djx)
k`
reject area
=1;K
`
D
120
CHAPTER 7. PATTERN RECOGNITION
probability density
decision boundary
p(
C1 jx)
p(
C2 jx)
e12
X1
Figure 7.6:
❖
R
X2
The graphical representation for the elements of the confusion matrix. The hatched area represents element e12 .
or x is classi ed as D otherwise, i.e. R` > RD , 8`
associated to a classi cation in the doubt category.
D
x
= 1; K
, where RD is the risk
Proof. 1.
This is simply an extension to theorem 7.1.1 considering a supplementary class D with an
associated decision area XD .
2. This represents just the rule of classi cation according to the minimum risk.
confusion matrix
Another useful tool for estimating the capabilities of a classi er is the confusion matrix
whose elements are de ned as:
ek` = P (x classi ed as C` jx 2 Ck )
As a simple geometrical representation, the element ek` represents the integral of p(Ck jx)
over the decision area of class C`:
k` =
Z
Ck jx) dx
p(
e
X`
See gure 7.6. Note that ekk represents the probability of a correct classi cation for class Ck .
➧ 7.2
Likelihood Function
7.2.1 The Discriminant Functions
Instead of the most obvious discriminant function yk (x) = P (Ck jx) its logarithm may be
chosen:
p(xjCk ) P (Ck )
yk (x) = ln P (Ck jx) = ln
= ln p(xjCk ) + ln P (Ck ) + const.
p(x)
(see the Bayes theorem and theorem 7.1.1). The p(x) being class independent (normalization factor) is just an additive constant.
7.2. LIKELIHOOD FUNCTION
✍
121
Remarks:
➥
Considering each class-conditional probability density p(xjCk ) as independent
multidimensional Gaussian distribution3
(x ; k )T ;k 1 (x ; k )
1p
exp ;
p(xjCk ) =
2
(2)N=2 jk j
(each having its k and k parameters) then
yk = ;
(x ; k )T ;k 1 (x ; k ) 1
; 2 ln jk j + ln P (Ck ) + const.
2
Let k = , 8k 2 f1; : : : ; K g. Then ln jj is class independent (constant) and
xT ;1 k = Tk ;1 x. Eventually:
T ;1 k + ln P (C ) + xT ;1 x + const.
yk (x) = Tk ;1 x ; k
k
2
Because xT ;1 x is class independent then it may be dropped from the discriminant function yk (x) (being an additive factor, equal for all discriminants).
Eventually:
T ;1 k + ln P (C )
yk (x) = (Tk ;1 )x ; k
k
2
Let consider the matrix built using k as columns and the matrix W de ned
as:
T ;1 k + ln P (C )
W (1 : K; 1 : N ) = T ;1 and w0k = ; k
k
2
;
and xeT = 1 x1 : : : xN ; then the discriminant functions may be written
simply as
y = W xe
(i.e. the general sought form y = y(W; x)).
The above equation represents a linear form, such that the decision boundaries
are hyper-planes. The equation describing the hyper-plane decision boundary
between class Ck and C` is found by formulating the condition yk (x) = y`(x).
See gure 7.7 on the next page.
7.2.2 Likelihood Function and Maximum Likelihood Procedure
The maximum likelihood method tries to nd the best values for the parameters by maximizing a function named likelihood, using the training set. The procedure below is repeated
for each class Ck in turn.
Let consider a probability density function depending on x and a set of parameters W :
p = p(x; W ). Let also the training set be fxp gP = fx1 ; : : : ; xP g, all xp being taken from
3 See
statistical appendix
❖
fxp gP
122
CHAPTER 7. PATTERN RECOGNITION
x2
y1 (x) = y2 (x)
p2
p1
C1
C2
x1
Figure 7.7:
The linear discriminant function for a bidimensional
probability density and two classes. At the upper-left corner p1 > p2 , at the lower-right corner p1 < p2 .
the same class.
Considering the vectors from the training set randomly selected (training set statistically
signi cant), the join probability density for the whole fxp gP is:
p(fxp gP jW ) =
likelihood function
❖
L(W )
Y p(
P
p=1
xp
jW ) L(W )
where L(W ) is named likelihood function. The method is to try to nd the W set of
parameters for which L(W ) is maximum.
✍
Remarks:
➥
In the case of a Gaussian distribution the W parameters are de ned4 trough
and : = Efxg and = Ef(x ; )(x ; )T g. Then:
e X ;;;;!1!
X
e = P1 ( ; e )( ; e ) ;;;;
!1!
i.e. | the mean of is replaced with e | the mean of the training set
(considered statistically representative); and the same happens for .
= P1
P
p=1
xp
P
P
❖
(7.4)
xp
i=1
ee
,
xp
T
P
x
➥
Assuming an unidimensional Gaussian distribution then
e P1 X x
=
4 See
statistical appendix
P
p=1
p
f P1 X(x ; e)
and 2 =
P
p=1
p
2
7.2. LIKELIHOOD FUNCTION
123
e
Considering as and the true value for standard deviation then the expectation of assumed standard deviation, compared to the true one, is:
Effg = P P; 1
2
Proof.
Eff2 g = p21
The change of variable yp
=
Z1
;1
xp ;e
;1 dxp and
dyp = PP
2
Eff2 g = p2(P ; 1)
1
P
X
(xp
P p=1
2
; e)2 exp ; (xp2;2e)
is done, then xp ; e
P Z1
X
yp2 exp
p=1 ;1
2
; y2p
=
2
P ;1 xp
P
!
dyp =
dxi
; P1
N
P
xq
q=1
q6=p
such that
P
2
P ;1
(integral made by parts, similar to the calculus of E [(x ; )2 ] | see unidimensional Gaussian
distribution in statistical appendix).
➥
The build probability distribution have a bias with tends to 0 for P ! 1.
In this case the Gaussian distribution is also suited for sequential parameter estimation.
Assuming that not all patterns from the training set are known at once then new
patterns may be added later as they become available, i.e. the W parameters
may be adapted to the new training set
P
xp to
For the Gaussian distribution, adding a new pattern changes P = P1
p=1
the new value:
P +1
1
P +1 = P + 1 xp = P + xP P+1+;1P
p=1
e
X
e
bias
sequential parameter estimation
P
e
e
For multiple classes the likelihood function is de ned as:
P
K
p(xp jCk ; W ) P (Ck jW )
L(W ) = p(xp jW ) =
p=1
k=1 xp2Ck
Y
YY
Maximizing likelihood function is equivalent to minimizing its negative logarithm, which is
the usual way in practice. Considering that in the training set there are Pk vector patterns
for each class Ck then:
K Pk
K
Pk ln P (Ck jW )
(7.5)
E = ; ln L = ;
ln p(x(k)p jCk ; W ) ;
k=1 p=1
k=1
XX
where x(k)p 2 Ck .
X
❖
Pk
❖
x(k)p
124
CHAPTER 7. PATTERN RECOGNITION
For a given training set the above expression may be reduced to:
E
=
;
k
P
K X
X
k=1 p=1
ln p(x(k)p
jCk
;W)
+
const.
(7.6)
The expression is reduced by using the Lagrange multipliers method5 using as minimization condition
the normalization of P (Ck jW ):
Proof.
K
X
k=1
which leads to the Lagrange function:
L =E +
and then
and replacing into (7.7) gives
@L
@P (Ck jW )
=
K
X
k=1
=
K
X
k=1
(7.7)
!
P (Ck jW ) ; 1
; P (CPkjW ) + = 0
k = 1; K
k
)
Pk
P (Ck jW ) = 1
P (Ck jW ) =
Pk
K
P
P
=
k=1 k
Pk
P
As the training set is xed then fPk g are constant and then the likelihood (7.5) becomes (7.6).
As it can be seen, the formula (7.5) contains the sum both over classes and inside a
class. If the data gathering is easy and the classi cation (by some supervisor) is dicult
(\expensive") then the function (7.6) may be replaced by:
E
=
;
k
P
K X
X
k=1 p=1
ln p(x(k)p
jCk
;W)
;
P
X
0
p=1
ln p(xp
0
j
W)
The unclassi ed training set fxp g may still be very useful in nding the appropriate set of
W parameters.
0
✍
Remarks:
➥
The maximum likelihood method is based on nding the W parameters for which
the likelihood function L(W ) is maximum, i.e. where its derivative is 0 | W are
the roots of the derivative; The Robbins{Monro algorithm6 may be used to nd
them.
The maximum of likelihood function L(W ) (see equation (7.4)) may be found
from the condition:
rW
❖
rW ,
where rW is the vector
NW
5
6
See mathematical appendix.
See statistical appendix.
@
@w1
"
P
Y
p=1
:::
#
xp jW )
p(
@
@wNW
T
f
W
=0
(7.8)
, NW being the dimension of W
7.3. STATISTICAL MODELS
125
parameter space.
Because ln is a monotone function then ln L may be used instead of L and the
above condition becomes:
"
#
P
1r X
ln p(xp jW )
=0
W
P
f
W
p=1
(the factor 1=P is constant and does not a ect the result).
Considering the vectors from the training set randomly selected then
1r
lim
W
P !1 P
"
P
X
p=1
#
ln p(xp jW ) = EfrW ln p(xjW )g
and the condition (7.8) becomes:
EfrW ln p(xjW )g = 0
➥
f of this function are found using the Robbins{Monro algorithm.
The roots W
It is not possible to chose the W parameters such that the likelihood function
L(W ) =
P
Q
p=1
j
p(xp W )
(see (7.4)) is maximized, because the L(W ) may be
increased inde nitely by over tting the training set such that the estimated probability density is reduced to a function similar to Dirac function, having the
value 1 at the training set points and 0 elsewhere.
➧ 7.3
Statistical Models
It is important to note that the statistical model built as p(xjW ) generally di er from
the true probability density ptrue (x), which is also independent of W parameters. The
estimated probability density will give the best t of true p(x) for some W0 parameters.
These parameters may be found given an in nite training set. However, in practice, as the
f of W may be found.
learning set (P ) is nite then only an estimate W
0
It is possible to build a function such that it will measure the \distance" between the
estimated and the real probability densities. Then the W parameters have to be found such
that this function will be minimum.
Let consider the expected value of the minus logarithm of the likelihood function:
1
Ef; ln Lg = ; Plim
!1 P
P
X
p=1
Z
X
R
X
ptrue (x) ln ptrue (x) dx.
The the following function | called asymmetric divergence 7 | is de ned:
7
See [Rip96] pp. 32{34.
also known as Fullback{Leiber distance
f
W0 , W
ln p(xp jW ) = ; ln[p(xjW )] ptrue (x) dx
which, for p(xjW ) = ptrue (x), have the value: ;
7.3
❖ ptrue (x),
asymmetric
divergence
126
CHAPTER 7. PATTERN RECOGNITION
L
= Ef; ln Lg +
Z
Z
(x) ln (x) x = ; ln (xj (x))
true
p
X
p
p
d
W
p
X
ptrue
(x) x
(7.9)
d
The asymmetric divergence L is positive de nite, i.e. L > 0, the equality being for p(xjW ) = ptrue (x).
Proposition 7.3.1.
df = ; 1 + 1 and is negative
Let consider the function f (x) = ; ln x + x ; 1. Its derivative is dx
x
for x < 1 and positive for x > 1. It follows that the f (x) function have a minimum for x = 1 where
f (1) = 0. Because xlim
!1 f (x) = 1 then the function f (x) is positive de nite, i.e. f (x) > 0, 8x, the
equality happening for x = 1.
Let now consider the function:
Proof.
) = ; ln p(xjW ) + p(xjW ) ; 1 > 0
f pp(xjW
ptrue (x) ptrue (x)
true (x)
and because it is positive de nite then its expectation is positive:
) = ; Z ln p(xjW ) p (x) dx + Z p(xjW ) p (x) dx ; 1
E f pp(xjW
ptrue(x) true
p (x) true
true (x)
X
X true
Z
) p (x) dx > 0
= ; ln pp(xjW
true
true (x)
X
R
(p(xjW ) is normalized such that p(xjW ) dx = 1) and then L > 0, being 0 when the probability distribuX
tions are equal.
As previously discussed, usually the model chosen for probability density p(xjW ) is not even
from the same class of models as the \true" probability density ptrue (x), i.e. they may have
totally di erent functional expressions. However there is a set of parameters W0 for which
the asymmetric divergence (7.9) is minimized:
8
<Z
min
[ln
W :
9
=
ptrue
X
(x) ; ln (xj )] ( x) x; =
p
W
p
Z
d
X
(x)
ln (true
xj 0 )
p
p
ptrue
W
(x) x
d
(7.10)
The minimization of the negative logarithm of the likelihood function involves nding a set
f . Due to the limitation of training set, in general W
f 6= W0
of (\optimal") parameters W
f ;;;;! W0 .
but at the limit P ! 1 (P being the number of training patterns) W
P !1
Considering the Nabla operator r and the negative logarithm of likelihood E :
r =
T
❖
J
,K
; @
@w1
@
@wW
;
E
= ; ln L( ) = ; ln
P
Y
W
(xp j )
p
p=1
W
then the following matrices are de ned:
J
;
= E rr
T
E
= ;E
(
P
X
;
p=1
)
rr ln (xp j ) = ;
T
p
W
P
X
;
p=1
rr ln (xp j )
T
p
W0
7.3. STATISTICAL MODELS
and
K
=
=
E (r
P
X
p=1
E )(
r
r ln
E)
T
127
8
! P
!T 9
P
< X
=
X
=E
r
ln p(xjW )
r
ln p(xjW )
: p=1
;
p=1
!
p(
P
X
xjW0 )
p=1
r ln
p(
xjW0 )
!T
f ; W0 by the
For P suciently large it is possible to approximate the distribution of W
(Gaussian) normal distribution NN (0b; J ;1 K J ;1) (here and below W are seen as vectors,
i.e. column matrices).
E is minimized with respect to W , then:
Proof.
rE jWf = 0b )
P
X
p=1
f) = 0
b
r ln p(xp jW
f and W0 are suciently close and a Taylor series development may
Considering P reasonably large then W
be done around W0 for rE jW
f:
f ; W0 ) + O ((W
f ; W0 )2 )
rE jWf = rE jW0 + H jW0 (W
;
f and W0 are to be seen as
where H = rrT E and is named the Hessian. Note that here and below W
b
0=
vectors
(column matrices). Finally:
f ; W0 ' H ;1 jW rE jW
W
0
0
❖
H
f ; W0 g = lim H ;1 jW rE jW = 0
b
) EfW
0
0
P !1
f ;;;;! W0 .
as W
P !1
Also:
n
o
n
o
f ; W0 )(W
f ; W0 )T = E H ;1 rE rTEH ;1T = J ;1 KJ ;1
E (W
(7.11)
(by using the matrix property (AB )T = B T AT ).
De nition 7.3.1. The deviance D of a pattern vector is de ned as being twice the ex-
pectancy of log-likelihood of the best model minus the log-likelihood of current model, the
best model being the true model or an exact t, also named a saturated model:
D
Efln
=2
f )g
x) ; ln p(xjW
ptrue (
The deviance may be approximated by:
D
Proof.
'2
L
+ Tr(K J
;1 )
f ) around W0 :
Considering the Taylor series development of ln p(xjW
T
c ; W0 ) + 1 (W
f ) ' ln p(xjW0 ) + rT ln p(xjW )jW (W
f
f
ln p(xjW
0
2 ; W0 ) H jW0 (W ; W0 )
It is assumed that the gradient of asymptotic divergence (7.10)(see also (7.9)) is zero at W0 :
rL = Efr ln p(xjW0)g = 0b
(as it hits an minima). The following matrix relation is also true:
;
f ; W0 )T H jW (W
f ; W0 ) = Tr H jW (W
f ; W0 )(W
f ; W0 )T
(W
0
0
deviance
128
CHAPTER 7. PATTERN RECOGNITION
(may be proven by making H jW0 diagonal, using its eigenvectors, is a symmetrical matrix, see mathematical
appendix) and then, from the de nition, the deviance approximation is:
and nally, using also (7.11):
f ; W0)(Wf ; W0)T )g
D ' 2L ; EfTr(H jW0 (W
h
i
f ; W0 )(Wf ; W0)T g
D ' 2L + Tr J Ef(W
❖
DN
L + Tr(KJ ;1 )
Considering a sum over training examples, instead of integration over whole X , the deviance
DN for a given training set may be approximated as
P
DN ' 2 ln p (xp ) + Tr(KJ ;
p(xp jW )
p
X
=1
information
criterion
=2
true
f
(
1)
)
where the left term of above equation is named information criterion.
CHAPTER
8
Single Layer Neural Networks
➧ 8.1
Linear Separability
8.1.1 Discriminant Functions
Two Classes Case
Let consider the problem of classi cation in two classes with linear decision boundary such
that the classes are separated by a hyperplane in the pattern space. Then the discriminant
function is the equation describing that hyperplane and so it is linear in x. See also gure 8.1
on the following page.
y(x) = wT x + w0
(8.1)
The w is the parameter vector and w0 is named bias . For y(x) > 0 the pattern x is assigned
to one class, for y(x) < 0 it is assigned to the other class, the hyperplane being de ned by
y(x) = 0.
Considering two vectors x1 and x2 contained within hyperplane (decision boundary) then
y(x) = y(x2 ) = 0. From (8.1) wT (x1 ; x2 ) = 0, i.e. w is normal to any vector in the
hyperplane and then it is normal to the hyperplane y(w) = 0. See gure 8.1 on the next
page.
The distance between the origin and the hyperplane is given by the scalar product between
a versor perpendicular on the plane kwwk and a vector pointing to a point in the plane x,
8.1
See [Bis95] pp. 77-89.
129
❖
w , w0
130
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
x2
( )=0
y w
w
C1
; kw0k
w
Figure 8.1:
x1
C2
Linear discriminant in a two dimensional pattern space
with two classes.
de ned by y(x) = 0. Then:
distance =
❖
N
T
0
kwk = ; kwk
w x
w
such that the bias de nes the shift of the hyperplane from origin.
Considering N to be the dimension of the pattern space, the whole classi cation problem
may be transferred to a N + 1 dimensional pattern space by considering the translations
x ! x = (1; x)
and w ! w = (w0 ; w)
e
e
such that the discriminant equation becomes:
T
y (x) = w x
ee
de ning a hyperplane in the N + 1 dimension space, passing trough the origin (bias is 0
now).
The whole process may be expressed in terms of one neuron which have N + 1 inputs and
e
one output being the weighted sum of its inputs y(x) = 1 w0 +
on the facing page.
Multiple Classes Case
Let consider several classes fC g
k
k
N
i
=1
wi xi
. See gure 8.2
and one linear discriminant function for each class:
=1;K
( ) = wT x +
yk x
P
k
wk0
;
k
=1
(8.2)
;K
such that a pattern x is assigned to class C if y (x) = max y (x), K being the dimension
k
k
`
❖
K
=1;K
`
of the output space.
The decision boundary between classes C and C is given by the equation y (x) = y (x):
( w ; w ) T x + (w 0 ; w 0 ) = 0
k
k
`
`
k
k
`
`
8.1. LINEAR SEPARABILITY
131
1
x1
w0
xN
w1
wN
e
y (x)
Figure 8.2:
A single, simple, neuron performs the weighted sum of its
inputs and may act as a linear classi er.
e
x0
=1
e
1
e
y1 (x)
Figure 8.3:
e
x1
xN
2
K
e
e
y2 (x)
yK (x)
A single layer of neurons may act as a linear classi er for
classes. The connection between input i and neuron k
e .
is weighted by we being the component i of vector w
K
ki
k
which is the equation of a hyperplane in the pattern space.
The distance from the hyperplane (k; `) to the origin is:
distance(
k;`)
=
; kw ;; w k
wk0
w`0
k
`
Similarly to the previous two classes case it is possible to move to the N + 1 space by the
transformation:
e = (1; x) and w ! w
e = (w 0 ; w ) ; k = 1 ; K
x ! x
k
k
k
k
and then the whole process may be represented by a neural network having one layer of K
neurons and N + 1 inputs. See gure 8.3.
e T as rows then network output is simply written
Note that if a matrix W is built using w
as:
k
e
e
y (x ) = W x
The training of the network consists in nding the adequate W . A new vector x is assigned
to the class C for which the corresponding neuron k have the biggest output y (x).
k
k
❖
W
132
linear
separability
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
De nition 8.1.1. Considering a set of pattern vectors, they are called linearly separable if
they may be separated by a set of hyperplanes as decision boundaries in the pattern space.
Proposition 8.1.1. The regions Xk de ned by linear discriminant functions yk (x) are simply connected and convex.
Let consider two vector patterns: x x 2 .
Proof.
Then
a;
yk
(x ) = max (x ) and
a
`=1;K
y`
a
yk
Xk
(x ) = max (x ). Any point on the line connecting
b
may be de ned by the vector:
xc
b
=
`=1;K
y`
b
xa + (1 ; t)xb where
t
t
2 [0 1]
;
(see also Jensen's inequality in mathematical appendix).
Also y (x ) = ty (x ) + (1 ; t)y (x ) and then y (x ) = max
k
c
k
a
k
b
k
c
xa and xb
`=1;K
y`
(x ), i.e. x 2
c
c
Xk
(because
y`
are
linear).
Then any line connecting two points, from the same X , is contained in the X domain , X is convex
and simple connected.
k
k
k
8.1.2 Neuronal Memory Capacity
general position
❖
F (P; N )
Let consider one neuron with N inputs and one output which have to learn P pattern
vectors. All input vectors belong to one of two classes, i.e. either C1 or C2 and the output of
neuron indicates to which class the input belongs (e.g. y(x) = 1 for x 2 C1 and y(x) = ;1
for x 2 C2 ) Considering that the input vectors are points in RN space the neuron may learn
only those cases where the inputs are linearly separable by a hyperplane. As the number
of linearly separable cases is limited so is the learning capacity/memory of a single neuron
(and of course the learning capacity of the network is limited as well).
Let consider that there are P xed points in RN , in general position, i.e. for N > 2 there
are not N or fewer points linearly dependent. Let consider that either of these points may
belong to class C1 or C2 , the total number of combinations is 2P (as each point brings up
P cases some are linearly separable and
2 cases, independently of the others). From the 2
some are not, let F (P; N ) be the number of linearly separable cases. Then the probability
of linear separability is:
Probability of linear separability =
F (P; N )
2
P
Proposition 8.1.2. The number of linearly separable cases is given by:
F (P; N ) = 2
N ;
X
P
i=0
1
i
(8.3)
(where 0! = 1 by de nition).
Proof.
It is proven by induction.
A hyperplan in R is de ned by the equation aT x + b = 0 (where x is a point contained in the hyperplan),
i.e. is de ned by N + 1 parameters.
N
8.1.2
See [Rip96] pp. 119{120.
8.1. LINEAR SEPARABILITY
Let consider rst the case:
6
133
. Then the set of equations:
8
>
<> 0 one side of hyperplan
aT xi + b = yi < 0 other side of hyperplan
>
:
= 0 contained in the hyperplan
de ne the hyperplan parameters. As there are N + 1 parameters and at most N + 1 equations and the
points are in general position then the system of equations with unknowns a and b (hyperplan parameters)
have always a solution, i.e. there is always a way to separate the two classes with a hyperplan (there may
be several solutions); then:
P for P 6 N + 1
F (P; N ) = 2
P
N
+1
Let now consider the case P > N + 1. A recurrent formula is attempted for F (P + 1; N ). Let consider that
P linearly separable points are already \in position", i.e. one case from F (P; N ), and a new one is added.
There are two cases to be considered:
The new point sides on one side only of any separating (the P set) hyperplane. Then the new set is
separable only if the new point is on the \correct" side, i.e. the same as for its class.
If the above it's not true then it is possible to choose the separating hyperplane as to pass trough the
new point. Then no matter to which class the new point is assigned, the new set is linearly separable.
Let considering again only the P points set and the hyperplane as chosen above. If all points are
projected into a (new) hyperplane perpendicular on the separating one, then the points in the new
hyperplane are linearly separate (by the hyperline given by the intersection of the two hyperplanes).
This means that the number of possibilities in this situation is F (P; N ; 1) and the number of
combinations is 2F (P; N ; 1), as the P + 1-th point may be assigned to either class.
Finally, the rst case analyzed above gives F (P; N ) minus the number of possibilities in the second case
(i.e. F (P; N ; 1)) and the second case gives 2F (P; N ; 1). Thus the wanted recurrent formula is:
F (P + 1; N ) = [F (P; N ) ; F (P; N ; 1)] + 2F (P; N ; 1) = F (P; N ) + F (P; N ; 1)
(8.4)
Induction: for F (P; N ; 1), from (8.3), the expression is:
N;
X1 P ; 1
F (P; N
; 1) = 2
=2
N P
X
; 1
;1
i
i=1 i
i=0
and then, using (8.4), (8.3) and the above equation, the expression for F (P + 1; N ) is:
N P
N P ; 1 P ; 1 X
X
F (P
+ 1; N ) =
F (P; N )
+ F (P; N
; 1) = 2 + 2
=2
+
i
i
i=0 i
i=1
;
;
;
i.e. is of the same form as (8.3) (the property P;i 1 + P;i 1 = Pi was used here1 ).
For P = 4 and N = 2 the total number of cases is 24 = 16 out of which 14 are linearly separable. One of
the two cases which may not be linearly separated is depicted in gure 8.4 on the following page, the other
one is its mirror image. So the formula (8.3) checks also for an initial case.
The probability of linear separability is then:
8
<1
Plinear
separability
=
and then
i=0
8
>
<> 0:5
Plinear
separability
1
: 2P1;1
See mathematical appendix.
= 0:5
>
:
<
N ;P ;1
P
0:5
i
for P
for P
for P
<
P
P
6
>
N
+1
N
+1
2(N + 1)
= 2(N + 1)
>
2(N + 1)
134
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
x2
C
C
2
(1; 1)
C
C
1
(0; 0)
Figure 8.4:
1
(0; 1)
2
(1; 0)
x1
The XOR problem. The vectors marked with black circles
are from one class; the vectors marked with white circles are from the other class. C1 and C2 are not linearly
separable.
i.e. the memory capacity of a single neuron is around 2(N + 1).
✍
Remarks:
➥
➥
As the points (pattern vectors) from the same class are usually (to some extent)
correlated then the memory capacity of a single neuron is much higher than the
above reasoning suggests.
A simple problem of classi cation where the pattern vectors are not linear separable is the exclusive-or (XOR), in the bidimensional space.
The vectors (0; 0) and (1; 1) are from one class (0 xor 0 = 0, 1 xor 1 = 0); while
the vectors (0; 1) and (1; 0) are from the other (1 xor 0 = 1, 0 xor 1 = 1). See
gure 8.4.
8.1.3 Logistic discrimination
The discriminant functions may be generalized by replacing the linear functions (8.2) with
a monotone function applied to the linear combination of w and x
k (x) = f (wkT x + wk0 )
y
activation
function
;
k
= 1; K
where f is named activation function.
From the Bayesian theorem:
p(
Ck jx) =
jCk ) (Ck ) =
p(x
p(xjCk )P (Ck )
K
P p(xjC`) P (C`)
`=1
P
p(x)
=
1
1
P p xjC P C 1 + ;a
K
1+
`=1 ; `=k
(
`)
p(xjC )P (C
6
k
k)
(
`)
e
=
(8.5)
8.1. LINEAR SEPARABILITY
135
f ( a)
1
2
;1
Figure 8.5:
;
0
1
The logistic signal activation function. The particular
case of step function is drawn with a dashed line. The
maximum value of the function is 1.
where:
a ln
p(xjCk )
K
P
`=1
`6=k
✍
a
+
p(xjC` )
+ ln
P (Ck )
K
P
`=1
`6=k
(8.6)
P (C` )
Remarks:
➥
For the Gaussian model with the same variance matrix for all classes:
(x ; k )T ;1 (x ; k )
exp
;
p(xjCk ) =
2
(2)n=2 jj
1p
;
k = 1; K
and two classes C1 and C2 then the expression of a becomes:
a = wT x + w0
where ( is symmetric):
w
= ;1 (1 ; 2 )
T ;1 1 ; T2 ;1 2
P (C1 )
+
ln
2
P (C2 )
w0 = ; 1
Then, by choosing the logistic sigmoid activation function as f (a) = 1+1e;a |
see gure 8.5 | the meaning of the neuron output becomes simply the posterior
probability P (Ck jx).
➥ The logistic sigmoid activation function have also the property of mapping the
interval (;1; 1) into [0; 1] and thus limiting the neuron output.
Another choice for the activation function of neuron is the threshold (step) function:
f (a) =
(
1 for a > 0
0 for a < 0
(8.7)
sigmoid
function
threshold
function
136
perceptron
adaline
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
and the neurons having it are called perceptrons or adaline.
✍
Remarks:
➥
8.1.4
Bernoulli
distribution
The step function is the particular case of the logistic signal activation function
1
f (a) =
1+e;ca when c ! 1. See gure 8.5 on the page before.
Binary pattern vectors
A binary pattern vector x have its components xi 2 f0; 1g.
Let Pki be the probability that the component i of vector x 2 Ck is xi = 1, respectively
(1 ; Pki ) is the probability of xi = 0. Then:
xi
1;x
p(xi jCk ) = P
ki (1 ; Pki ) i
also named Bernoulli distribution. Assuming that the components of the pattern vector x
are statistically independent then:
N
xi
1;x
P
p(xjCk ) =
ki (1 ; Pki ) i
i=1
Y
By taking the discriminant function in the form of:
k (x) = ln P (xjCk ) + ln P (Ck )
y
then it may be written as:
k (x) = wkT x + w0k
y
where:
ki = ln Pki ; ln(1 ; Pki ) ; i = 1; N and
N
ln(1 ; Pki ) + ln P (Ck )
w0k =
i=1
w
X
Similar as above, from the Bayesian theorem, see (8.5) and (8.6), | for two classes C1
and C2 | the posterior probability P (Ck jx) may be expressed as the output of the neuron
having the logistic sigmoidal activation function:
P(
C1 jx) = (wT x +
f
where:
w
w
0
1i
2i
P
i = ln P
=
XN
ln
w
0) =
1
1 + exp[
; ln 11 ;; 1i
P
P
1
1
;
2i
i
; 1i + ln (C1 )
; 2i
(C2 )
P
;(wT x +
= 1; N
and
P
P
i=1
and P (C2 jx) have a similar form (obtainable by swapping 1 $ 2).
P
w
0 )]
8.2. THE LEAST SQUARES TECHNIQUE
137
8.1.5 Generalized linear discriminants
Generalized linear discriminant functions are obtained by replacing x with a vectorial function
of it: ' : X ! X having the same dimension. The discriminant functions then becomes
y
k(
'
(8.8)
x) = wkT (x) + w0k
and, by switching to the N + 1 space:
y
e
e 'e
x) = wkT (x)
k(
where 'e0 (x) 1.
e associated with each output
By building the (N + 1) K matrix W having the weights w
neuron as rows:
0 we we 1
10
1
CA
W =B
@ ... . . .
we 0 we
k
❖
W
N
K
KN
then:
y(x) =
➧ 8.2
W 'e (x)
The Least Squares Technique
8.2.1 The Error Function
Let fx g =1 be the training set and ft g
the sum-of-squares error function is:
p
p
p
;P
E (W ) =
1
2
XX
P
p
the desired output of the network. Then
=1;P
p
K
y
[
=1 k=1
xp ; wk )
k(
;t
kp ]
E (W ) =
p
1
2
XX
P
p
K
=1 k=1
e 'e
wkT (xp )
;t
2 = 1 X[W 'e (x ) ; t ]T [W 'e (x ) ; t ]
P
kp
2
p
p
=1
p
sum-of-squares
error function
(8.9)
2
where t is the component k of desired vector t .
Considering the generalized linear discriminants of the form (8.8) then E (W ) is a quadratic
function of weights and its derivatives are linear function of weights and then the minimum
of the error function may be found exactly in closed form.
kp
❖ tp
p
p
❖
t
❖
~y
kp
(8.10)
e T on rows).
(here W contains w
The geometrical interpretation of error function
Let consider the P -dimensional vector ~y which components are the outputs of the same
k
8.2
See [Bis95] pp. 89{98.
k
138
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
neuron
k while the network is presented with the xp training vectors:
~ykT =
; T
e k e ( 1)
w 'x
we kT'e (xP )
:::
;
e (x) vectorial function: '
e (x)T = '
e0 (x) : : : '
eN (x) and
Using the components of the '
its value for the training set vectors xp , it is possible to build the vector:
;
'~ Ti = 'ei (x1 ) : : : 'ei (xP )
As the component p of vector ~yk is f~yk gp
=
N
P
i=0
weki 'ei (xp ) then ~yk may be written as a
linear combination of '~ i in the N + 1 space:
N
X
~yk = weki '~ i
i=0
(8.11)
e k ).
(weki being the component i of vector w
K
N P
K
P
P
weki '~ i and, again, is a linear combiThe sum of ~yk vectors is ~ytotal = ~yk =
i=0 k=1
k=1
nation of '~ i .
Similar, the vector ~tk may be build, using the target values for output neuron k given the
input vector xp as being tkp :
~tTk = ;tk1 : : : tkP
Finally, the sum{of{squares error function (8.9) may be written as:
E=
P K
1 XX
N
X
wek` 'e` (xp ) ; tkp
p=1 k=1 `=0
K
1X
=
k~yk ; ~tk k2
2
k=1
2
!2
K P
=
1 XX
2
k=1 p=1
f~yk gp ; tkp )2
(
(8.12)
Let make the reasonable assumption that the number of inputs is smaller that the number
of sample vectors in the training set, i.e. N + 1 6 P (what happens if this is not true is
discussed later). Let consider the space of dimension P : the set of N + 1 vectors f'~ i g
de ne a subspace in which all ~yk are contained, the vector ~tk being in general not included.
Then the sum{of{squares error function (8.12) represents simply the sum of all distances
between ~yk and ~tk . See gure 8.6 on the facing page.
The ~tk may be decomposed into two components:
~tkk 2 S ('~ 0 ; : : : ; '~ N ) and ~tk? ? S ('~ 0 ; : : : ; '~ N )
❖
S
where S ('~ 0 ; : : : ; '~ N ) is the sub-space de ned by the set of functions f'~ i g. See gure 8.6.
8.2. THE LEAST SQUARES TECHNIQUE
~tk
'~ 1
~tk?
~tkk
S ('~ 0 ; '~ 1 )
Figure 8.6:
139
~yk ; ~tk
~yk
'~ 0
The error vector ~yk ; ~tk for output neuron k. S ('~ 0 ; '~ 1 )
represents the subspace de ned by the set of functions
f'~ i g in a bidimensional case | one input neuron plus
bias. The ~yk and ~tkk are included in S ('~ 0 ; '~ 1 ) space.
The minimum of error function: E = 12
K
P
(~yk ; ~tk )T (~yk ; ~tk ) (see (8.11) and (8.12)) with
k=1
respect to weki is found from the condition that its derivatives are zero:
@E
(8.13)
= 0 , '~ Ti (~yk ; ~tk ) = 0 ; k = 1; K ; i = 0; N
@ weki
and because ~tk = ~tkk + ~tk? and '~ Ti ~tk? = 0 (by choice of ~tkk and ~tk? ) then the above
condition is equivalent with:
'~ Ti (~yk ; ~tkk ) = 0 , ~yk = ~tkk ; k = 1; K
(8.14)
i.e. fweki g should be chosen such that ~yk = ~tkk | see also gure 8.6.
Note that there is always a \residual" error due to the ~tk? .
Assuming that the network is optimized such that ~yk = ~tkk , 8k 2 f1; : : : ; K g then
~yk ; ~tk = ;~tk? and the error function (8.12) becomes:
K
K
1X
1X
~ 2
k
~yk ; ~tk k2 =
Emin =
2 k=1
2 k=1 ktk? k
8.2.2
The Pseudo{inverse solution
Let build the P (N + 1) matrix from '~ i used as columns:
0 'e ( ) 'e ( ) 1
0 1
N 1
CA
.
.
.
.
=B
@ .
.
'e0 ( P ) 'eN ( P )
x
x
x
x
❖
140
❖T
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
also the P K matrix T using ~tk vectors as columns:
0t
11
B
. ..
T = @ ..
.
tK 1
t1P
tKP
1
CA
From the above matrices and (8.11): ~yk = W (k; :)T . Then using the above notations,
the set of minima conditions (8.14) may be written in matrix form as:
(T )W T ; T T = e0
Assuming that the square (N + 1) (N + 1) matrix T is inversable then the solution
for the weights is:
WT
pseudo-inverse
matrix
;
= T ;1 T T = y T
(8.15)
where y is the pseudo-inverse matrix of (which generally is not square) and is de ned
as:
;
y = T ;1 T
If the T is not inversable then taking an
de ned as:
"
2 R the pseudo-inverse matrix may be
; T
;1 T
y = "lim
!0 + "I
✍
Remarks:
➥
➥
➥
As the matrix is built from the '~ i set of vectors, if two of them are parallel (or
nearly parallel) then the T is singular (or nearly singular) | the rank of the
matrix will be lower.
The case of nearly singularity also leads to large weights necessary to represent
the solution ~yk = ~tkk . See gure 8.7 on the facing page.
In case of two parallel vectors '~ i k '~ ` the one of them is proportional with
another: '~ i / '~ ` and then they may be combined together in the error function
and thus reducing the number of dimensions of S space.
By writing explicitly the minima conditions (8.14) for biases wk0 :
@E
@wk0
=
N
P
X
X
p=1
`=1
e
wk` '` (xp ) + wk0
!
; tkp = 0
('e0 (xp ) = 1, by construction of 'e0 ) the bias may be written as:
wk 0
❖ t k , '`
where:
= tk ;
N
X
`=1
wk` '`
8.2. THE LEAST SQUARES TECHNIQUE
141
~yk = ~tkk
wk1 '~ 1
'~ 1
'~ 0
S ('~ ; '~ )
0
1
wk0 '~ 0
a) '~ 0 and '~ 1 almost perpendicular
~yk = ~tkk
wk1 '~ 1
'~ 1
'~ 0
S ('~ ; '~ )
0
wk0 '~ 0
1
b) '~ 0 and '~ 1 almost parallel
Figure 8.7:
The solution ~yk = ~tkk in a bidimensional space S ('~ 0 ; '~ 1 ).
Figure a presents the case of nearly orthogonal set, gure
b presents the case of a nearly parallel set of f'~ i g functions. ~tkk and '~ 0 were kept the same in both cases
tk =
➥
P
1X
t
P
p=1
kp
and '` =
P
1X
'e` (xp )
P
p=1
i.e. the bias compensate the di erence between the mean of targeted output tk
and mean of the actual output | over the training set.
If the number of vectors in the training set P is equal with the number of inputs
N + 1 then the matrix is square and it have an inverse. By multiplying with
; T ;1
to the left in (8.15): W T = T ) W T = ;1 T . Geometrically
speaking ~tk 2 S and ~tk? = 0 | see gure 8.6, i.e. the network is capable to
learn perfectly the target and the error function is zero (after training).
If P 6 N + 1 then the ~tk vectors are included into a subspace of S and to
minimize the error to zero it us enough to make the projection of ~yk (into that
subspace) equal with ~tk . (a situation similar | mutatis mutandis | to that
represented in gure 8.6 on page 139 but with ~yk and ~tk swapping places).
This means that just a part of weights are to be adapted (found); the other ones
do not count (there are an in nity of solutions).
142
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
➥
8.2.3
To have P 6 N + 1 is not normally a good idea because the network acts more
as a memory rather that as a generalizer (the network have a strong tendency to
overadapt).
The solution developed in this section does not work if the neuron does not have
a linear activation, e.g. sigmoid activation function.
The Gradient Descent Solution
The neuronal activation function is supposed to be di erentiable and the error function may
be expressed as function of weights E = E (W ). Then an initial value for fwki g parameters
is chosen (usually weights are initialized randomly) and the parameters are modi ed by
small values in the direction of decrease of E , i.e. in the direction of ;rE (with respect to
weights), in small steps:
wki = w s
( +1)ki
❖ t,
learning rate
@E
; w s ki = ; @w
( )
;
ki W(s)
= const. 2 R
+
being the step of iteration (discrete time). is a positive constant called learning rate
and governs the speed by which the fwki g parameters are changed.
Obviously ;rE may be represented as a matrix of the same dimensions as W and then
the above equation may be written simply as:
W = W(t+1) ; W(t) = ;rE
(8.16)
t
and is known as the delta rule.
Usually the error function is expressed as a sum over the training set of a P error terms:
❖ Ep
E
P
= P Ep (W ), then the weight adjustment may be done in steps, for each training vector
p=1
in turn:
wki = w t
( +1)ki
✍
@Ep
; w t ki = ; @w
( )
ki W(t)
;
p
= 1; P
Remarks:
The above procedure is especially useful if the training set is not available from
start but rather the vectors are arriving as a time series.
➥ The learning parameter may be chosen to decrease in time, e.g. = t .
This procedure is very similar to the Robbins{Monro algorithm for nding the
root of derivative of E , i.e. the minima of E .
Assuming general linear discriminant (8.8) then considering the sum-of-squares error function (8.10):
➥
0
P X
K
1X
E (W ) =
2p k
=1 =1
"
N
X
i=0
w
eki '
ei
(xp ) ; tkp
#2
;
K
1X
E p (W ) =
2k
=1
"
N
X
i=0
w
eki '
ei
(xp ) ; tkp
#2
8.2. THE LEAST SQUARES TECHNIQUE
then:
@Ep
@wk`
=
"N
X
i=0
( ) ; tkp
143
#
or, in matrix notation:
( ) = [yk (xp ) ; tkp ] 'e` (xp )
'
e` xp
w
eki 'ei xp
rEp = [y(xp ) ; tp ] 'e T (xp )
where y(xp ) = W 'e (xp ). Then the delta rule2 (8.16) becomes:
delta rule
So far only networks with identity activation function were discussed. In general neurons
have a di erentiable activation function f (the perceptron being an exception) then total
input to neuron k is akp = W (k; :) 'e (xp ) and:
❖f
W = ;[W 'e (xp ) ; tp ] 'e T(xp )
ap
= W 'e (xp )
y(xp ) = f (ap )
;
The sum-of-squares error (8.9), for each training pattern p is:
( ) = 21 [f (ap ) ; tp ]T[f (ap ) ; tp]
Ep W
and then:
rEp = f[f (ap ) ; tp ]
( )g'e T (xp )
f 0 ap
where f is the total derivative of f .
❖ f0
0
Proof.
From the expression of Ep :
Ep (W ) =
and then:
1
2
K
X
f akp ) ; tkp
[ (
k=1
1
2
] =
2
K " X
N
X
f
k=1
i=1
!
wki 'ei (xp ) ; tkp
#2
@Ep
= [f (akp ) ; tkp ]f (akp ) '
e` (xp )
@wk`
which leads directly to the matrix formula above.
0
✍
Remarks:
➥
In the case of sigmoid function f (x) = 1+1e;x the derivative is:
df
( ) = dx
= f (x)[1 ; f (x)]
f0 x
In this case writing f in terms of f speeds up the calculation and save some
memory on digital simulations.
It is easily seen that the derivatives are \local", i.e. depend only on parameters
linked to the particular neuron in focus and do not depend on the values linked
to other neurons.
0
➥
2 This equation is also known as the least-mean-square (LMS) rule, the adaline rule and the Widrow-Ho
rule.
144
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
➥
The total derivative over whole training set may be found easily from:
P
@E = X
@Ep
@wk` p=1 @wk`
➧ 8.3
The Perceptron
The perceptron (or adaline 3) represents a single layer neural network with threshold activation function (see (8.7)). Usually the activation function is chosen as being odd:
(
for a > 0
f (a) = ;+11 for
a<0
8.3.1
The Error Function
e T'e (x)). Because the output of the
Considering just one neuron, its output is y(x) = f (w
e T'e > 0
single neuron is either +1 or ;1 then it may classify just a set of two classes: if w
e T 'e < 0, the output is ;1 and x 2 C2.
then the output is +1 and x 2 C1 , else w
e T 'e > 0, 8x; where t is the target value given the input
Then, for a correct classi cation, tw
e T 'e > 0 while t = ;1 or vice-versa, i.e.
vector x. For a misclassi ed input vector either w
;twe T 'e < 0.
A good choice for the error function will be:
E (w) = ;
❖
M
X
xp2M
twe T 'e (xp )
(8.17)
where M is the set of misclassi ed vectors xp .
✍
Remarks:
➥
8.3
3
e T'e (xp ) is proportional to
From the discussion in section 8.1.1 it follows that w
the distance from the misclassi ed vector 'e (xp ) to the decision boundary.
The process of minimizing the function (8.17) is equivalent to shifting the decision
boundary such that misclassi cation becomes minimum.
During the shifting process M changes as some previously misclassi ed vectors
becomes correctly classi ed and vice-versa.
See [Bis95] pp. 98{105.
From ADAptive LINear Element.
8.3. THE PERCEPTRON
145
e
C
'1
e
1
;t
=1
e
t'
w(1)
e
w(0)
e
'0
'e
C
Figure 8.8:
8.3.2
2
;t
= ;1
The learning process for perceptron. White circles are
from one class, black ones are from the other. Initially
e (0) and the pattern shown by 'e is misthe parameter is w
e (1) = we (0) ; 'e ; was chosen 1 and
classi ed. Then w
t = ;1. The decision boundary is always perpendicular
to w vector, see section 8.1.1. The other case of a misclassi ed xp 2 C1 is similar.
The Learning Procedure
The gradient descent solution (section 8.2.3) is used to nd the weight vector:
(
= ;0 t'e ( ) ifif 2is M
correctly classi
and then rE = ;t'e ( ) if 2 M or rE = b otherwise.
@Ep
@wi
xp
xp
xp
p
i xp
xp
p
ed
0
The delta rule (8.16) becomes:
e = e
w
w(t+1)
(
'e (x ) if x 2 M
; we = t
b0
if x is correctly classi ed
(t)
p
p
p
(8.18)
i.e. all training vectors are tested: if the xp is correctly classi ed then w is left unchanged,
otherwise it is \adapted" and the process is repeated until all vectors from the training set
are classi ed correctly. See gure 8.8.
8.3.3
Convergence of Learning
The error function (8.17) decreases by using the learning rule (8.18).
Proof.
The terms from E (8.17), after one learning step using (8.18), are:
T
T
T
'e
'e ; t2 k'e k2 < ;we (t)
'e = ;twe (t)
;twe (t+1)
146
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
where k'e k2 = 'e T 'e ; then E(t+1) < E(t) , i.e. E decreases.
w
❖ b
b such that:
Let consider a linearly separable problem. Then it exists a solution w
tp
wb T 'e (xp )
>
0
;
p
= 1; P
The process of updating w, using the delta rule (8.18) is convergent.
e (0) = 0b and
Proof. Let consider that the initial vector | in the above learning procedure | is chosen as w
let the learning parameter be = 1 (without any loss of generality); then:
e (t+1) = we (t) + tp 'e (xp )
w
where xp is a misclassi ed training vector. (see (8.18)). Then the weight vector may be written as:
e (t+1) =
w
❖ `
X
`
where ` is the number of misclassi cation of x` vector | note that while w changes, the decision boundary
changes and a training pattern vector may move from being correctly classi ed to being misclassi ed and
back. The sum is done over the training cycle | the training set may be used several times in any order.
b T to the left:
By multiplying with w
b T we =
w
❖
e
` t` '(x` )
X
`
b e
` t` wT '(x` ) >
X
`
!
` t`
b T 'e (x` )
min
w
`
such that the product is limited from below by a function linear in
b is constant).
limited from below as well (w
= P k tk | and thus we (t+1) is
k
On the other hand:
kwe (t+1) k2 = kwe (t) k2 + t2` k'e (x` )k2 + 2t` we (Tt) 'e (x` ) 6 kwe (t) k2 + k'e (x` )k2
Therefore:
and then:
e (t+1)
i.e. kw
kwe k2 = kwe (t+1) k2 ; kwe (t) k2 6 max
k'e (x` )k2
`
kwe (t+1) k2 6 max
k'e (x` )k2
`
k is limited from above by a function linear in p .
Considering both limitations (below by and above by p ) it follows that no matter how large t is, i.e. no
p
matter how many update steps are taken, have to be limited (because from below grows faster than
from above, during training) and then it means that at some stage ` becomes stationary for all ` 2 f1; P g
b was presumed to exists) all training vectors becomes correctly classi ed.
| thus (because w
✍
Remarks:
➥
➥
The learning algorithm is good at generalization as long as the training set is
statistically signi cant.
The perceptron may be successfully used only for linearly separable classes.
8.4. FISHER LINEAR DISCRIMINANT
➧ 8.4
8.4.1
147
Fisher Linear Discriminant
Two Classes Case
A very simple way to reduce the dimensionality it to apply a linear projection, into a unidimensional space, of the form:
y = wT x
(8.19)
where w is the vector of parameters chosen such as to maximize separability.
Let consider two classes and a training set containing P1 vectors of class C1 and P2 vectors
of class C2 . The mean vectors of class distribution are:
m1 = P1
1
X
xp 2C1
x
m2 = P1
and
p
2
X
xp 2C2
❖ P 1 , P2 ,
m1, m2
x
p
Then a natural choice for w would be such that it will maximize the distance between the
unidimensional projection of means, i.e. wT (m1 ; m2) | on the other hand this distance
may be arbitrary increased by increasing w; to avoid this a constraint on the size of w
should be imposed, e.g. a normalization: kwk2 = wT w = 1.
The Lagrange multiplier method is applied (see mathematical appendix) the Lagrange function (using the normalization on w) is:
L(w; ) = wT (m1 ; m2) + (kwk2 ; 1)
and the required solution is found from the conditions:
@L
@w = m1 ; m2 + 2w = 0
i
i
i
@L = X w2 ; 1 = 0
@
and
i
i
i
which gives w / m1 ; m2. However this solution is not generally good because it considers
only the relative positions of the distributions, not their form, e.g. for Gaussian distribution
this means the matrix . See gure 8.9 on the next page.
One way to measure the within class scatter, of the uni-dimensional projection of the data,
is
s2 =
k
X
xp 2Ck
2
= wT 4
y(x ) ; wT m
p
X
xp 2Ck
k
2
=
X
xp 2Ck
T
wT x ; wT m x w ; mTw
p
k
p
See [Bis95] pp. 105{112.
s
k
k
3
(x ; m ) (x ; m )T 5 w
p
k
p
k
and the total scatter, for two classes, would be s2total = s21 + s22 . Then a criteria to search
for w would be to minimize the scattering.
The Fisher technique takes the approach of maximizing the inverse of total scattering. The
8.4
❖
Fisher criterion
148
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
x2
C1
w
m1
C2
m2
y
x1
Figure 8.9: The unidimensional reduction in y space using the La-
grange multiplier: Gaussian distribution, two classes, two
dimensions. While the separation is better than a projection on x1 axis, it is worse than that one in the x2 axis
because the parameter was not considered.
Fisher criterion is de ned as:
2
T
w
m
wT (m1 ; m2)(m1 ; m2)Tw
1 ; w m2
J (w) =
=
2
2
s1 + s2
s21 + s22
; T
and it may be expressed also as:
J (w ) =
❖ Sb
where Sb is named between-class covariance matrix:
Sb
❖ Sw
wT Sbw
wTSw w
= (m1 ; m2)(m1 ; m2)T
and Sw is named within-class covariance matrix:
Sw
=
X
xp 2C1
(xp ; m1) (xp ; m1)T +
X
xp 2C2
(xp ; m2) (xp ; m2)T
The gradient of J with respect to w is zero at the desired maximum:
T
T
rJ = (wSw w )S(wb wS ;w(Tw)S2 b w )Sw w = 0
w
(8.20)
8.4. FISHER LINEAR DISCRIMINANT
149
wSw wT )Sbw = (wSb wT )Sw w
)
(
(8.21)
From (8.20) it gives that:
Sb w = (m1 ; m2 )(m1 ; m2)T w / (m1 ; m2 )
Because only the direction of
(8.22) into (8.21) gives:
(8.22)
w matters then any scalar terms may be dropped; replacing
w / Sw;1(m1 ; m2)
known as the Fisher discriminant.
8.4.2
Fisher
discriminant
Connections With The Least Squares Technique
For the particular case of transformation (8.19), the sum-of-squares error function (8.9)
becomes:
E
=
P
1 X;
2
p=1
wTxp + w0 ; tp2
The target values are chosen as follows:
ti
8
<P
=
P1
; PP2
:
if xp 2 C1
if xp 2 C2
(8.23)
where P1 is the number of training patterns in C1 and similar for P2 , obviously P1 + P2 = P .
The minima of E with respect to w and w0 is found by zeroing its derivatives:
rE =
@E
@w0
=
P
X
p=1
P
X
p=1
wT xp + w0 ; tp)xp = 0
(8.24a)
wT xp + w0 ; tp) = 0
(8.24b)
(
(
❖ P 1 , P2
P
The sum in (8.24b) may be split on two sums following the membership of xp : xp2C1
P
and xp2C2 ; from the particular choice (8.23) for tp and because P1 + P2 = P then:
w0
=
;wT m
where
m = P1
P
X
p=1
xp = PP1 m1 + PP2 m2
(8.25)
i.e. m represents the mean of x over the whole training set.
The sum from (8.24a) may be split in 4 terms | separate summation over each class,
❖
m
150
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
replacing w0 from (8.25) and replacing tp values from (8.23):
X
rE =
wTxp ; PP1 wT m1 ; PP xp ; PP2 wT m2
xp 2C1
1
X
xp 2C1
xp
(8.26)
X
P
P
2
T
xp = 0
w xp ; P w m2 + P xp ; PP1 wT m1
+
2
xp 2C2
xp 2C2
X
T
This relation reduces to:
Using the de nition of m1 2 , (8.26) becomes:
Proof.
;
X
xp 2C1
T
x
wT x )x ;
(
p
X
+
As w
Sw + P1PP2 Sb w = P (m1 ; m2)
xp 2C2
p
p
X P1
xp 2C1 P
w T x )x ;
(
p
p
(
wT m1)x ;
p
X P2
(
xp 2C2 P
wT m2)x
p
X P
xp 2C1 P1
x ; PP2 (wT m2 )P1 m1
X P
+
p
xp 2C2 P2
x ; PP1 (wT m1)P2 m2 = 0
p
xT w and the same for other w products:
X
X P1
X
x (xT w) ;
x (mT1w) ;
P
=
p
xp 2C1
+
T
p
X
xp 2C2
p
xp 2C1
x (xT w) ;
p
p
P
xp ; P1PP2 m1 (mT2 w)
P
xp 2C1 1
p
X P2
xp 2C2 P
x (mT2 w) +
p
and using again the de nitions of m1 2 :
X P
xp 2C2 P2
x ; P1PP2 m2(mT1 w) = 0
p
;
X
2
xp 2C1
+
x (xT w) ; PP1 m1(mT1 w) ; P m1 ; P1PP2 m1(mT2 w)
p
X
p
x (xT w) ; PP2 m2 (mT2w) + P m2 ; P1PP2 m2 (mT1 w) = 0
2
p
xp 2C2
p
As matrix multiplication is associative, w is a common factor. A 1 2 m1 mT1 +
and then subtracted to help form S (it's moved to the right of equality):
P P
P
P
m2mT2 is added
b
"
X
xp 2C1
+
x xT ;
p
X
xp 2C2
p
xx
p
T
p
P12
P
X
xp 2C1
m1 mT1 ; P1PP2 m1mT1
2
; PP2 m2mT2 ; P1PP2 m2 mT2
Terms 2, 3, 5, 6 reduces (P
"
P1 P2
=
x x ; P1 m1 m
p
T
p
#
w = P (m1 ; m2) ; P1PP2 S w
b
P1 + P2 ) to give:
T+
1
X
xp 2C2
#
x x ; P2 m2 m w = P (m1 ; m2) ; P1PP2 S w
p
T
p
T
2
and expanding m1 2 shows that the square parenthesis is S .
;
w
b
Because Sb w / (m1 ; m2) (see (8.22)) and only the direction of w counts then, by
dropping the irrelevant constant factors, the Fisher discriminant is obtained:
w / Sw;1(m1 ; m2)
8.4. FISHER LINEAR DISCRIMINANT
8.4.3
151
Multiple Classes Case
It is assumed that the dimensionality of pattern vector space is greater than the number of
classes, i.e. N > K .
A set of K transformations is considered
T
;
j = 1; K
, y(x) = W x
(8.27)
yk (x) = wk x
where the matrix W is build using wkT as rows.
Let Pk the number of training vectors (from the whole set) being of class Ck , and mk be
the mean vector of that class:
P
K
mk = P1
xp ; k = 1; K and m = P1 xp = P1 Pk mk
k xp2Ck
p=1
k=1
❖
W
❖
P
❖
P
❖
S
w
The total covariance matrix St is
P
K
T
T
(xp ; m)(xp ; m)
St =
(xp ; m)(xp ; m) =
p=1
k=1 xp2Ck
❖
S
t
and may be written as:
❖
S
b
X
X
X
PK
Pk is the total number of training vectors, and m is the mean over the
k=1
whole training set.
The generalization of within-class covariance matrix (8.20) is easily performed as:
K
X
X (x ; m )(x ; m )T
Sw =
Swk
where Swk =
p
k p
k
k=1
xp 2Ck
where
P
=
X
XX
t = Sw + Sb
S
where
b=
S
K
X
k=1
k (mk ; m)(mk ; m)T
P
where St , Sw and Sb are de ned in X pattern space.
Proof.
St =
=
K 2 X
X
4
3
1
0
X T
X A T
T
@
xp + Pk mm 5
xp m ; m
xp xp ;
K 2 X
X
4
3
xp xp ; Pk mk m ; mPk mk + Pk mm 5
k=1
k=1
xp 2Ck
xp 2Ck
T
xp 2Ck
xp 2Ck
T
T
T
T
By adding, and then subtracting, a term of the form Pk mk mTk , the Sb is formed and then:
St =
K 2 X
X
4
k=1
xp 2Ck
3
xp xTp ; Pk mk mTk 5 + Sb = Sw + Sb
k , mk
,m
152
❖
k ,
CHAPTER 8. SINGLE LAYER NEURAL NETWORKS
Similar matrices may be expresses in the Y output space.
Let k and be the mean over class Ck and, respectively, over all training set of output
y(xp ):
P
K
y(xp ) ; k = 1; K and = P1 y(xp ) = P1 Pk k
k = P1
k xp2Ck
p=1
k=1
X
X
X
The covariance matrices in Y space are:
K
T
[y(xp ) ; k ][y(xp ) ; k ]
S(Y )w =
k=1 xp2Ck
K
T
S(Y )b =
Pk [k ; ][k ; ]
k=1
XX
X
One possibility4 for the Fisher criterion is:
J (W )
;1
Y w S(Y )b ) = Tr(W Sw W
= Tr(S( )
T
b
WS W
T
)
(considering (8.27)).
✍
Remarks:
➥
b is a sum of K matrices, each of rank 1 | because it represents a product of
vectors. Also there is a relation between all mk given by the de nition of m.
Then the rank of Sb is K ; 1 at most, and, consequently it have only K ; 1
eigenvectors/values.
By the means of Fisher criterion it is possible to nd only K ; 1 transformations.
S
2
4 There
are several choices.
CHAPTER
9
Multi Layer Neural Networks
➧ 9.1
Feed-Forward Networks
Feedforward networks do not contain feedback connections. Also between the input units x
and the output units y there are (usually) some hidden units z . Let N be the dimension
(number of neurons) of input layer, H the dimension of hidden layer and K the one of
output layer See gure 9.1 on the following page.
Assuming that the weights for the hidden layer are fw(1) g =1 (characterizing the coni
k
j
ji
nection to z from
neurons is:
j
x ) and the activation function is f
i
z
=
j
f
X
N
1
w
(1)ji
x
❖
N, H, K
❖
w
(1)ji
, f1
❖
w
(2)ki
, f2
;H
j
i=0;N
1
then the output of the hidden
!
i
i=0
where x0 = 1 at all times | w 0 being the bias (characteristic to hidden neuron j ). Note
that there are no connections from x to z0 , they would be irrelevant as z0 represents the
bias and its output is 1 at all times (regardless of its input).
On similar grounds, let fw(2) g =1 be the weights of the output layer, and f2 its actij
i
kj
k
;K
j =0;H
vation function. Then the output of the output neuron is:
0
X
y =f @ w
H
H
k
2
j =0
9.1
1 0
X
z A=f @ w
(2)kj
j
2
j =0
See [Bis95] pp. 116{121.
153
!1
X
w x A
f
N
(2)ki 1
(1)ji
i=0
i
154
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
x0
z0
Input
x1
zH
z1
y1
Output
Figure 9.1:
xN
yK
The feedforward network architecture. Between the input
units xi and output units yk there are some hidden units
zj . x0 and z0 represents the bias.
The above formula may be easily generalized to a network with multiple hidden layers.
✍
Remarks:
➥
➥
➧ 9.2
While the network depicted in gure 9.1 appears to have 3 layers, there are in
fact only 2 processing layers: hidden and output. So this network will be said to
have 2 layers.
The input layer x plays the role of distributing the inputs to all neurons in subsequent layer, i.e. it plays the role of a sensory layer.
In general a network is of feedforward type if there is a possibility to label all
neurons (input, hidden and output) such that any neuron will receive inputs only
from those with lower number label.
By the above de nition, more general neural networks may be build (than the
one from gure 9.1).
Threshold Neurons
A threshold neuron have the activation function of the form:
f (a) =
(
1
0
if a > 0
if a < 0
and the a = 0 value may be assigned to either case.
9.2
See [Bis95] pp. 121{126.
(9.1)
9.2. THRESHOLD NEURONS
9.2.1
155
Binary Vectors
Let consider the case of binary pattern vectors, i.e. xi 2 f0; 1g, 8i. On the other end,
let the outputs be also binary. One output neuron will be considered, the discussion being
easily generalisable to multiple output neurons.
Then the problem is to model a Boolean function fB : f0; 1gN ! f0; 1g. The total number
of possible inputs is 2N . The function fB is totally de ned once the output value is given
for all input combinations.
Then the network may be build as follows:
The number of hidden neurons is equal to the number of input patterns which should
result in an output equal to 1.
The activation function for hidden neurons is de ned as:
(
1
if a > 0
f1 (a) =
0
if a 6 0
Each hidden neuron is set to
( be activated just by one pattern: For that patternN the
P
1
if xi = 1
weights are set up: wji =
;1 if xi = 0 and wj0 = 1 ; nx ; where nx = i=1 xi ,
i.e. is equal with the number of \ones" into the x pattern vector.
Then the total input to a hidden neuron is 1 nx + 1 (1 ; nx) = 1 and then the
output of the hidden neuron is 1 when presented with the \learned" pattern vector.
The total input is at most (nx ; 1) + (1 ; nx ) = 0 when presented with another
pattern vector (one xi component changed from 1 to 0 or vice-versa); such that the
output will be 0.
The activation function for the output neuron is:
(
f2 (a)
=
1
0
if a > 0
if a < 0
The weights to the output neuron are set to 1. The bias w(2)0 is set to ;1 such that
when a pattern for which the output should be 1 is presented to the net, the total
input in y is 0; otherwise is ;1 and thus the correct output is ensured at all times.
A vectorial output function may be split into components and each component may be
assigned to a output neuron.
✍
9.2.2
Remarks:
➥
While not very useful by itself (it does not have a generalization capability) the
above architecture illustrate the possible importance of singular neuron \ ring"
| used extensively in CPN and ART neural networks.
Continuous Vectors
The two propositions below assume a 2 class problem (either x 2 C1 or x 2 C2 ) and thus one
output neuron is enough for classi cation. The solution is easily extensible to multi-class
❖
f
B
156
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
z1
z
H
z2
y
Figure 9.2:
= 1^
z
:::
^ zH
In a two layer network the hidden layer may represent the
hyperplane decision boundaries and then the output layer
may do a logical AND to establish if the input vector is
within the decision region. Thus any convex decision area
may be represented by a two layer network.
problems by assigning each output neuron to a class.
Proposition 9.2.1. A two-layer neural network may have arbitrary convex decision boundary.
Proof. A single layer neural network (with one output) have a decision boundary which is a hyperplane.
Let the hidden layer represent the hyperplanes (see chapter \Single Layer Neural Networks").
Then the output layer may perform an logical AND between the outputs of the hidden layer to decide if
the pattern vector is inside the decision region or not | each output neuron representing a class. See
gure 9.2.
Proposition 9.2.2. A 3-layer neural network may have arbitrary decision boundary (it may
be non-convex and/or disjoint).
The pattern space is divided into suciently small hypercubes such that the decision region may be
approximated using them (i.e. each hypercube will be included either in the C1 decision region or in that of
C2 's). The decision area may be approximated with arbitrary precision by making the hypercubes smaller.
The neural network is built as follows: The rst hidden layer contains a group of 2N neurons for each
hypercube of the same one class (2 hyperplanes for each dimension to de ne the hypercube). The second
hidden layer contains N neurons who receive the input from the corresponding group of 2N neurons from
the rst hidden layer. The output layer receive its inputs from all N neurons from the second hidden layer.
The architecture of the network is depicted in gure 9.3 on the next page.
By the same method as described in the previous proposition a neuron from the second layer may decide if
the input pattern vector is inside the hypercube it represents.
An logical OR on the output layer, between the outputs of the second hidden layer, decides if the pattern
vector belongs to any of hypercubes represented by the rst layer and thus to the class selected; if not then
the input vector belongs to the other class.
Proof.
✍
Remarks:
➥
The network architecture described in proof of proposition 9.2.2 have the disadvantage that require large hidden layers.
9.3. SIGMOIDAL NEURONS
157
xN
x1
rst
hidden layer
second
hidden layer
y
Figure 9.3: The 3 layer neural network architecture for arbitrary de-
cision areas.
➧ 9.3
Sigmoidal Neurons
The possible sigmoidal activation functions are:
y
=
f (a)
=
1
;
1 + e ca
and
;ca
ca + e;ca
e
ca
y
=
f (a)
= tanh(a) =
e
;
e
where c = const..
2)+1
1
= 1+e;ca then using the tanh function instead of the logistic one
Because tanh(a=
2
is equivalent to apply a linear transformation ea = a=2 before and (again a linear transformation) y = 12 ye + 1 after the processing on neural layer (this is equivalent in a linear
transformation of weights and biases).
The tanh function have the advantage of being symmetrical with respect to the origin. See
gure 9.4 on the following page.
✍
9.3
Remarks:
➥
The output of a logistic neuron is limited to the interval [0; 1]. However the
logistic function is easily inversable and then the inverse f ;1 (y) = 1c ln 1;y y .
➥
In the vicinity of the origin the logistic function is almost linear and thus it can
approximate a linear neuron.
See [Bis95] pp. 126{132.
158
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
tanh(x)
1:0
c
= 10
c
c
=1
=3
;1 0
;2 0
:
x
2:0
:
Figure 9.4: The graph of tanh function for various c constants.
9.3.1
Three Layer Networks
A three layer network may approximate with any accuracy a smooth
function (mapping) X ! Y .
Proposition 9.3.1.
Proof.
The logistic activation function is:
y=
1
1 + exp(
;wT x ; w0 )
and is represents the output of a rst-layer neuron. See gure 9.5{a .
By making linear combinations, i.e. when entering the second neuronal layer, it is possible to get a function
with an absolute maximum. See gure 9.5{b and 9.5{c.
By applying the sigmoidal function on the second layer, i.e. when exiting the second layer, it is possible to
get just a localized output, i.e. just the maximum of the linear combination. See gure 9.5{d.
The third layer may combine the localized outputs of the second layer to perform the approximation of any
smooth function | it should have a linear activation function.
9.3.2
Two Layer Networks
Proposition 9.3.2. A two layer neuronal network can approximate, arbitrary well, any function (mapping) X ! Y , provided that X and Y spaces are nite-dimensional and there are
enough hidden neurons.
Proof.
X ci
Any function may be decompose into a Fourier series:
y(x1 ; : : : ; xN ) =
=
i1
x ; : : : ; xN ) cos(i1 x1 ) = : : :
1( 2
X X Ci iN YN
i1
iN
1
`=1
i` x` )
cos(
(by developing in series all c parameters).
Any product of 2 cosines may be transformed into a sum, using the formula cos cos = 12 cos( + ) +
1
2 cos( ; ). Then by applying this procedure N ; 1 times, the product from the equation above may be
changed to a sum of cosines (the values of the angles and the constants don't have to be speci ed for the
proof).
9.3. SIGMOIDAL NEURONS
159
f1
1:0
1:0
0:5
0:5
0:0
10
0
a1
1:0
f2
;10
;10 0:0
0
10
a2
a)
10
a1
f3
0:0
10
0
;10
a1
Figure 9.5:
10
0
;10
a2
b)
0:5
0:0
10
0
;10
a1
c)
d)
Two dimensional space. Figure a) shows the sigmoidal
function f1 = 1+ ;1 1 ; 2 . Figure b) shows the linear combination f2 = 1+ ; 11 ; 2 ;5 ; 1+ ; 11 ; 2 +5 | they are displaced relatively one each other. Figure c) shows a linear
combination of 4 sigmoidal functions f3 = 1+ ;015; 2 ;5 ;
05
05
05
1+ ; 1 ; 2 +5 + 1+ ; 1 + 2 ;5 ; 1+ ; 1 + 2 +5 | the second
pair is rotated by =2. Figure d) shows the output after
applying the sigmoidal function again f4 = 1+ 141 3 +1 6 |
only the central maximum remains.
e
a
a
e
a
a
e
a
a
e
e
a
:
a
e
a
:
a
;10
a2
f4
1:0
0:5
0
10
0
;10
e
a
:
a
:
a
a
e
f
:
10
0
;10
a2
160
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
f
f (xi+1 )
f (xi )
x0 xi xi+1
Figure 9.6:
The approximation of a function by S steps.
Using the Heaviside step function:
H (x) =
any function may be approximated as:
f (x) ' f (x0 ) +
❖
S
xS x
S
X
i=1
(
1
0
if x > 0
if x < 0
f xi ) ; f (xi;1 )] H (x ; xi )
[ (
where S is the number of steps by which the function was approximated. See gure 9.6.
Then the network is build/performs as follows:
The rst layer have threshold activation functions and calculate the values of cosine functions.
The second layer performs the linear combination from the Fourier series development.
➧ 9.4
Weight-Space Symmetry
Considering the tanh activation function then by changing the sign on all weights and the
bias the output of neuron have reverted sign (tanh is symmetrical with respect to origin,
see also gure 9.4 on page 158).
If the network have two layers and the weights and biases on the second layer have also the
signs reverted then the nal output of network remains unchanged.
Then the two sets of weights and biases leads to same result, i.e. there is a symmetry of
the weights towards origin.
Assuming a 2-layer network with H hidden neurons then the total number of weights and
biases sets which gives the same nal result is 2H
Also, if all weights and the bias are interchanged between two neurons on one layer then
the output of the next layer remains unchanged (assuming that the layers are fully interconnected).
9.4
See [Bis95] pg. 133.
9.5. HIGHER-ORDER NEURONAL NETWORKS
161
There are H ! such possible combinations on the hidden layer.
Finally there are H !2H sets of weights witch gives the same output on a 2-layer network.
On more complex networks there may be even more symmetries.
The symmetry in weights leads directly to a symmetry in error function and then the error
will have several equivalent minima. Eventually this means that the minima point of error
may be much closer than it looks at rst sight, regardless of the starting point, usually
randomly selected.
➧ 9.5
Higher-Order Neuronal Networks
The neurons studied so far performed a linear combination of their inputs before applying
the activation function:
N
ftotal inputgj = aj = wji xi then foutputgj = yj = f (aj )
i=0
(or in matrix notation: a = W x and y = f (a)).
It is possible to design a neuron which performs a higher-degree combination of its inputs,
e.g. a second-order:
N N
N
wji` xi x`
ftotal inputgj = wj0 + wji xi +
i=1 `=1
i=1
X
X
✍
➧ 9.6
9.6.1
XX
Remarks:
➥
The main diculty in dealing with such neurons consists in the tremendous increase in the number of W parameters.
A rst-ordered neuron will have N + 1 parameters while a second order neuron
will have N 2 + N + 1 parameters and for N 1 ) N 2 N .
On the other hand higher-order neurons may be build to be invariant to some
transformations of the pattern space, e.g. translations, rotation and scaling. This
property may make them usable into the rst layer of network.
Backpropagation Algorithm
Error Backpropagation
It is assumed that the error function may be written as a sum over all training vector patterns
P
E = Ep
p=1
X
9.5
See [Bis95] pp. 133{135.
9.6
See [Bis95] pp. 140{148.
162
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
and also that the error function E = E (y) is di erentiable with respect to the output
variables. Then just one vector pattern is considered at a time.
The output of network is calculated by forward propagation from:
p
X
p
wji zi
and
zj = f (aj )
a = W (j; :) z
and
z = f (a)
aj =
(9.2)
i
❖ aj
where a is the weighted sum of all inputs entering neuron z , z being here any neuron,
hidden or output.
Then E also depends on w trough a and then the derivative of E with respect to w
is
j
j
p
ji
@Ep
@wji
=
j
p
@Ep @aj
=
@aj @wji
rE
❖ j ,
@Ep
@aj
raE
=
p
p
)
zi = j zi
zT
ji
(9.3)
zT
=
where jp is named error | it's a factor determinant in weight adjusting. (as shown
below); ra E , note that here rE represents just the part linked to W from the whole
error gradient
z for all layers is found trough a forward propagation trough the network. is found by
backpropagation (from output to input) as follows:
For the output layer: E = E (f (a)) and then:
@E
j
@a
p
p
p
out;k
p
@E
@a
p
@Ep df (ak )
=
@yk
k
out
❖ f0
i
=
dak
ry E
=
@Ep
@yk
(9.4)
f 0 (ak )
f 0 (a)
p
where f is the total derivative of activation function f .
For other layers: neuron z a ects the error E trough all other neurons to which it
sends its output:
0
j
j =
@Ep
@aj
=
X
`
p
@Ep @a`
@a` @aj
=
X
next;`
`
@a`
(9.5)
@aj
where the sum is done over all neurons to which z send connections (next `p )
and from the expression of a and de nition of nally the back-propagation formula:
@E
❖ next;` , next
`
j = f 0 (aj )
X
j
;`
@a
`
w`j next;`
(9.6)
`
by the means of which the derivatives may be calculated backwards | from the output
layer to the input layer. In matrix notation:
= f (a)
0
(W
T
next )
By knowing the derivatives of E with respect to the weights, the W parameters may be
adjusted in the direction of minimizing the error, using the delta rule.
9.6. BACKPROPAGATION ALGORITHM
163
9.6.2 Application: Sigmoidal Neurons and Sum-of-squares Error
Let consider a network having logistic activation function for its neurons and the sum-ofsquares error function.
Then the derivative of the activation function is:
( ) = 1 +1e;x )
df
f x
dx
= f (x)(1 ; f (x))
and the sum-of-squares error function for pattern vector xp is:
1 K [yk (xp ) ; tpk ]2 = 1 [y(xp ) ; tp]T [y(xp ) ; tp ]
Ep =
2 k=1
2
X
(9.7)
(9.8)
(it is assumed that yk and tpk are the components of the respective vectors y and tp , when
the network is presented with the pattern vector xp ).
From (9.8), (9.4) and (9.7), for output layer:
out
b
[b ; ]
= y(xp ) [1 ; y(xp )] [y(xp ) ; tp ]
and similar, for the hidden layers:
=z
1
z
(W T )
next
where the errors are calculated backwards starting with the hidden layer closest to the
output and ending with the rst hidden layer.
The error gradient (the part linked to W ) rEp is:
rEp = zT
To minimize the error, the weights have to be changed in direction contrary to that pointed
by the gradient vector, i.e. the amount will be:
W / ;rEp = ;rEp
where governs the overall speed of weight adaptation, i.e. the speed of learning, and thus
is named learning constant. i is a determining factor in weight change and thus is named
error . The above equation represents the delta rule.
✍
Remarks:
➥
The choice of order regarding training pattern vectors is optional:
Consider one at a time (including random selection for the next one).
Consider all together and then change the weights with the sum of all
individual weight adjustments
P
W = ; rEp
p=1
X
This represents the batch backpropagation.
learning constant
164
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
Any pattern vector may be selected multiple times for training/weight ad-
➥
➥
❖ W,
O
justing.
Any of the above in any order.
have to be suciently large to avoid the trap of local minima of E but on the
other hand it have to be suciently small such that the \absolute" minima will
not be jumped over.
In numerical simulation the most computational intensive terms are the matrix
operations found in (9.6). Then the computational time is proportional with
the number of weights W : O(W ) | where O is a linear function in W |
the number of neurons being usually much lower (unless there is a very sparse
network). Because the algorithm require in fact one forward and one backward
propagation trough the net the dependency is O(2W ).
On the other hand to explicitly calculate the derivatives of E would require
one pass trough the net for each one and then the computational time will be
proportional with O(W 2 ). The importance of backpropagation algorithm resides
in the fact that reduces computational time from O(W 2 ) to O(W ). However
the classical way of calculating rE by perturbing each w by a small amount
" & 0 is still good for checking the correctness of a digital implementation on a
particular system:
p
p
@Ep
@wji
' E (w
p
ji
ji
+ ")
; E (w ; ")
p
ji
2"
Another approach would be to calculate the derivatives:
@Ep
@aj
' E (a
p
j
+ ")
;E
; ")
aj
p(
2"
This approach still needs two steps for each neuron and then the computational
time is proportional with O(2M W ), where assuming M is the total number of
neurons.
Note that because the derivative is calculated with the aid of two values centered
around w , respectively a , the terms O(") are canceled (the bigger non-zero
terms neglected in the above approximations are O("2 )).
ji
➧ 9.7
j
Jacobian Matrix
The following matrix:
J
Jacobian
@y
k
@xi
=1;K
i=1;N
k
0
B
=@
@y1
@ x1
..
.
@y
K
@ x1
@x
@x
...
@ y1
..
.
N
K
N
@y
1
CA
is named the Jacobian matrix and it provides a measure of the local sensitivity of the network
9.7
See [Bis95] pp. 148{150.
9.7. JACOBIAN MATRIX
165
output to a change in the inputs, i.e. for small perturbation in input vector x x the
perturbation in the output vector is:
y ' J x
Considering the rst layer to which inputs send connections (i.e. the rst hidden layer):
Jki
k
=
= @y
@xi
X @yk @a` X @yk
@a` @xi
`
=
@a`
`
w`i
(9.9)
(the sum is made over all neurons to which inputs send connections). To use a matrix
notation, the following matrix is de ned:
ra y
@y
k
@a`
where a refers to a particular layer.
k
The derivatives @y
@a` are found by a procedure similar to the backpropagation algorithm.
A perturbation/change in a` is propagated trough all neurons to which neuron ` send
connections (its output), then:
@yk
@a`
and, because aq =
Pw
s
qs zs
=
X @yk @aq
q
@aq @a`
= P wqs f (as) then:
s
@yk
@a`
= f (a` )
0
X @yk
q
@aq
(9.10)
wq`
k
i.e. the derivatives @y
@a` may be evaluated in terms of the same derivatives of the next layer.
For the output layer:
@yk
@a`
= k` dfda(ak )
k
where a` here are those received by the output neurons (k` is the Kronecker symbol),
@yk
and the matrix ra y have just one non-zero
i.e. all partial derivatives are 0 except @a
k
element per each row/column.
For other layers: the derivatives are calculated backward using (9.10), then:
ra y = [1b f (aT )] [ra
0
next
yW]
When the rst layer is reached then the Jacobian is found from (9.9).
✍
Remarks:
➥
The same remarks as for backpropagation algorithm regarding the computing
time (see section 9.9) applies here.
❖
ra y
166
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
➧ 9.8
Hessian Tensor
The elements of the Hessian tensor are the second derivatives of the error function with
respect to weights:
H
✍
=
@2E
@wji @wk`
jikl
Remarks:
➥
➥
The Hessian is used in several non-linear learning algorithms by analyzing the
surface E = E (W ); the inverse of the Hessian is used to determine the least
signi cant weights in order to \prune" the network and it is also used to nd the
output uncertainties.
Considering that there are W weights then the Hessian have W 2 elements and
then the computational time is at least of O(W 2 ) order.
The error function is considered additive with respect to the set of training pattern vectors.
9.8.1 Diagonal Approximation
Here the Hessian is calculated by considering it diagonal, then only the diagonal elements
@2E
@w2 have to be found (all others are zero).
ji
The error function is considered
P a sum of elements over the training set as in the backpropagation algorithm: E = Ep .
p
From the expression of aj , in (9.2), the operator
@
@wji
=
@ @aj
@aj @wji
@
@wji
=
may be written as:
@
zi
@aj
and, because zi does not depend on wji then:
@2
2
@wji
=
@2 2
z
@a2j i
and the diagonal elements of the Hessian for pattern vector xp becomes
From (9.6) (and on similar grounds):
@ 2 Ep
@a2j
9.8
=
@Ep
d2 f (aj ) X
w`j
da2j
@a`
`
See [Bis95] pp. 150{160.
+
df (a ) 2 X X
j
daj
`
`0
w`j w`0 j
@ 2 Ep 2
@ 2 Ep
2 = @a2 zi .
@wji
j
@ 2 Ep
@a` @a`0
9.8. HESSIAN TENSOR
P
167
@ Ep @a`
p
0
(where @a@ j @E
@a` =
@a` @a` @aj ) the sum over ` and ` being done over all neurons to
`
which neuron i is sending connections. In matrix notation it may be written directly as:
2
0
0
0
ra ra )Ep = f 00 (a)
(W
(
T
next ) + [f 0 (a)
f 0 (a)]
By neglecting the non-diagonal terms:
@ 2 Ep
@a2j
@Ep
d2 f (aj ) X
w`j
2
daj
@a`
`
+
(W
df (aj )
daj
T
2 X
`
W T ranext )Ep
ra
next
@ 2 Ep
@ 2 a`
2
w`j
or in matrix notation:
ra ra )Ep = f 00 (a)
(W
(
0
T
next )
f 0 (a)]
+ [f (a)
W
2T
ra
(
next
ra
Ep
next )
and thus, the computational time is reduced from O(W 2 ) to O(W ) (by neglecting the o
2E
p
, ` 6= `0 ). Note however that in practice the Hessian is quite far from
diagonal terms @a@` @a
`
being diagonal.
0
9.8.2 Outer Product Approximation
Considering the sum-of-squares error function for pattern vector xp :
Ep
=
1
2
K
X
s=1
[ys (xp )
; tps ]2 = 21 [y(xp ) ; tp ]T [y(xp ) ; tp ]
then the Hessian is calculated immediately as:
@ 2 Ep
@wji @wk`
=
K
X
@ys @ys
s=1
@wji @wk`
+
K
X
s=1
(ys
2
ys
; ts ) @w@ @w
ji
k`
Considering a well trained network and the amount of noise small then the terms ys ; ts
have to be small and may be neglected; then:
@ 2 Ep
@wji @wk`
and
@ys
@wji
K
X
@ys @ys
s=1
@wji @wk`
may be found by backpropagation procedure.
9.8.3 Inverse Hessian
Let consider the Nabla vectorial operator with respect to the weights r =
the Hessian may be written as a square W W matrix
HP
=
P
X
p=1
r rT Ep
n
@
@wji
o
ji
then
168
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
(P being the number of vectors in the training set).
By adding a new training vector:
HP +1 = HP +
r rT EP +1
A similar recurrent formula may be developed for the inverse Hessian:
1 = H ;1
HP;+1
P
Proof.
;1 T EP +1 H ;1
P
; H1 P+ rrr
T H ;1 rE
P
P
+1
Considering 3 matrices A, B and C , A inversable, then it is true that:
;1 = A;1 ; A;1 B (I + CA;1 B );1 CA;1
(A + BC )
(the identity may be veri ed by multiplication by (A + BC ) to the left respectively to the right).
By putting HP = A, r = B and rT EP = C then rT HP;1 rEP +1 have dimension 1 (as the product
between a row and column matrix) and the required formula is obtained.
Using the above formula and starting with H0 / I the Hessian may be computed by just
one pass trough all training set.
9.8.4 Finite Di erences
Another possibility to nd the Hessian is by applying small perturbations to weights and
calculate the error; then:
1
@2E
=
[E (wji + "; wk` + ")
@wji @wk`
4"2
; E (wji ; "; wk` + ")
(9.11)
; E (wji + "; wk` ; ") + E (wji ; "; wk` ; ")] + O("2 )
' 41"2 [E (wji + "; wk` + ") ; E (wji ; "; wk` + ")
; E (wji + "; wk` ; ") + E (wji ; "; wk` ; ")]
where 0 . " 1. By choosing an interval centered around
cancels one each other.
Another approach is to use the gradient rE :
@2E
1
=
@wji @wk`
2"
' 2"
1
✍
Remarks:
➥
"
"
@E
@wji
@E
@wji
(wk` + ")
wji "
+
+
wk` "
;
@E
@wji
;
@E
@wji
(wij ; wk` )
#
(wk` + ")
wji "
;
+
the terms O(")
O("2 )
(9.12)
#
;
wk` "
(9.13)
The \brute force" attack used in (9.11) require O(4W ) computing time for each
Hessian element (one pass for each of four E terms), i.e. O(4W 3 ) computing
time for the whole tensor. However it represent a good way of checking other
algorithm implementations.
9.8. HESSIAN TENSOR
➥
9.8.5
169
The second approach, used in (9.11) require O(2W ) computing time to calculate
the gradient and thus O(2W 2 ) computing time for the whole tensor.
Exact Hessian
From (9.3):
@Ep
@wk` = k z`
@ 2 Ep
@wji @wk`
=
and then by di erentiating again:
@
@E
@aj
@aj
@wk`
@wji
p
=
@
@E
@aj
@wk`
p
zi = zi
@ (k z` )
@aj
(9.14)
(using also (9.2)).
Because z` = f (a` ) then:
@ 2 Ep
= zi k f (a` )h`j + zi z` bkj
0
@wji @wk`
where:
❖ h`j , bkj
h`j
@a`
@a
and
j
bkj
@k
@a
j
The h`j coecients
a`
depends over all as from neurons s which connects to neuron `, then:
h`j =
and (as a` =
P
s
w`s f (as ))
X @a` @as
s
@as @aj
(9.15)
then:
h`j =
X
s
w`s f 0 (as )hsj
i.e. the h`j coecients may be calculated in terms of the previous ones till the rst layer
which receive directly the input and for which the coecients do not have to be calculated.
✍
Remarks:
➥
➥
For the same neuron obviously h`` = 1.
For two di erent neurons, by continuing the development in (9.15): if a forward
propagation path can't be established from neuron ` to neuron j , then a` and aj
are independent and, consequently h`j = 0
The bkj coecients
The bkj are calculated using a backpropagation method.
The neuron k a ects the error trough all neurons s to which it sends connections:
k = f 0 (ak )
X
s
wsk s
170
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
(see also the backpropagation formula (9.6)). Then:
f 0 (ak )
@
bkj =
@aj
s
X
00
= f (ak )hkj
❖ f 00
X
s
!
(9.16)
wsk s
wsk s + f 0 (ak )
X
s
wsk bsj
where f 00 (a) = ddaf2 is the second derivative of the activation function.
2
✍
Remarks:
➥
The derivative @a@ proceed from the derivative @w@ , see (9.14). Its application is
correct only as long wji does not appear explicitly in the expression to be derived
in (9.16).
The weight wji represents the connection from neuron i to neuron j while the
sum in (9.16) is done over all neurons to which neuron j send connections to.
Thus | according to the feedforward network de nition | wji can't be among
the set of weights wsk and the formula is correct.
j
ji
For the s neuron on output layer:
s =
@Ep
@as
0
= f (as )
@Ep
@ys
(as ys = f (as )) and:
bss =
@s
@as
=
@
f 0 (as )
@as
00
@Ep
00
= f ( as )
@ys
@
@as
$
@
@ys
@ys
+ f (as )
= f ( as ) 1 + f
(because
00
@Ep
00
= f (as )
das @Ep
dys @ys
;1 0 (ys ) @Ep
@ys
@Ep
@ys
0
+ f (as )
0
+ f (as )
0
+ f (as )
@ @Ep
@ys @as
@ 2 Ep
@ys2
@ 2 Ep
@ys2
switch places, using the expression of s above and as = f ;1 (ys )).
For all other layers the bkj coecients may be found from (9.16) by backpropagation and
by considering the condition wjj = 0, i.e. into a feedforward network there is no connection
from a neuron to itself.
9.8.6
Multiplication with Hessian
Considering the Hessian as a matrix r rT Ep (see section 9.8.3) the problem is to nd an
easy way of multiplying it with a vector v having W components:
vT H
vT (rrT )Ep
9.8. HESSIAN TENSOR
171
Using the nite di erence method on a interval centered around the current set of weights W
(W is seen here as a vector ):
vT r(rT Ep ) =
1
2"
r Ep (W + "v) ; r Ep (W ; "v)
T
T
+
O(" )
2
where 0 . " 1.
Application.
Let consider a two layer feedforward neural network with the sum-of-squares error function.
Then:
a = W(1) xp ; z = f (a) and y = W(2) z
Let de ne the operator R() = vT r, then R(W ) = v (R is a scalar operator, when applied
to the components of W the results are the components of v as rW = 1b). Note that W
represents the total set of weights, i.e. W(1) [ W(2) ; also the v vector may be represented,
here, in the form of two matrices v(1) and v(2) , the same way W is represented by W(1)
and W(2) .
By applying R to a, z and y:
R(a) = v x
R(z) = f (a)R(a)
R(y) = W R(z) + v
❖ a, z, y
❖
R()
❖ v(1) , v(2)
(1)
0
(2)
(2)
z
From the sum-of-squares error function:
(2)
=
f (aout )
(1)
=
f 0 (a)
[y
0
❖ aout
; tp ] for output layer, aout = W
(2) )
=
f 00 (aout )
(1) )
=
f 00 (a)
R(aout )
R(a)
z
for hidden layer
W(2) (2)
(see (9.4), (9.6) and (9.8)).
By applying again the R operator:
R(
R(
(2)
(y
; tp ) + f (a
0
W(2) (2) + f 0 (a)
(2) )
R(y)
v(2) (2) + f (a)
0
Finally, from (9.3):
rW
rW
(2)
Ep = (2) zT
for output layer
(1)
Ep = (1) xT
for hidden layer
and then the components of vT H vector are found trough:
(2)
Ep =
R(
(2) )
zT + (2) R(zT ) for output layer
(2)
Ep =
R(
(1) )
xT
;
R rW
;
R rW
R
W(2) ( (2) )
for hidden layer
172
CHAPTER 9. MULTI LAYER NEURAL NETWORKS
✍
Remarks:
➥
The above
result may also be used to nd the Hessian, by making v of the form
;
T
v = 0 : : : 0 1 0 : : : 0 ; and then the expression vT R(Ep ) gives a
row of the Hessian \matrix".
CHAPTER
10
Radial Basis Function Networks
➧ 10.1
Exact Interpolation
For a unidimensional output network let consider a set of training vectors fx g =1 , the
corresponding set of targets ft g =1 and a function h : X ! Y which tries to map
exactly the training set to the targets such that:
(10.1)
h(x ) = t ; p = 1; P
p
p
p
p
;P
;P
p
p
It is assumed that the h(x) function may be found by the means of a linear combination of
a set of P basis functions of the form '(kx ; x k):
p
h(x) =
X
P
w '(kx ; x k)
p
p
=1
(10.2)
p
By building the symmetrical matrix:
0 '(kx ; x k) : : : '(kx
1
1
..
...
=B
@
.
'(kx1 ; x k) : : : '(kx
P
P
P
❖
1
; x1 k)
CA
..
.
; x k)
P
then from (10.1) and (10.2) it follows that:
;
~ T = w1 : : : w
where w
10.1
~T
w
P
and ~tT = ;t
1
= ~tT
::: t .
P
See [Bis95] pp. 164{167.
173
(10.3)
~ , ~t
❖w
174
CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS
~ set of parameters is found immediately if the square matrix is inversable (which
The w
usually is); the solution being:
~ = ~tT ;1
w
Examples of basis functions are:
Gaussian function: '(x) = exp ; 2x2
'(x) = (x2 + 2 ); , > 0
'(x) = x2 ln x
2
p
Multi-quadratic function: '(x) = x2 + 2
Cubic and linear functions: '(x) = x3 , '(x) = x
For the multidimensional network output the established relations are immediately extendible
as follows:
hk (xp ) = tkp ; p = 1; P ; k = 1; K
hk (x) =
❖
h
;
P
X
p=1
wkp '(kx ; xp k) ; k = 1; K
~ T as rows and
Let h h1 : : : hK then h(xp ) = tp and also by building W using all w
T using all ~tT again as rows then W = T and thus W = T ;1 (assuming inversable,
of course).
➧ 10.2
Radial Basis Function Networks
The radial basis function network is built by considering the basis function as an neuronal
activation function and the wkp parameters as weights.
✍
Remarks:
To perform a exact interpolation of the training set is not only unnecessary but
a bad thing as well | see the course of dimensionality problem. For this reason
some modi cations are made.
When using the basis functions in neural networks, the following changes are performed:
➥
The number H of basis functions do not need to be equal to the number P of training
vectors | usually is much less. So fwkp g ! fwkj g.
The basis functions do not have to be centered around the training vectors.
The basis functions may have themselves some tunable (during training) parameters.
There may be a bias parameter.
10.2
See [Bis95] pp. 167{169.
10.3. RELATION TO OTHER THEORIES
x1
175
x2
xN
'1
'0
y1
'H
y2
yK
Figure 10.1: The radial basis function network architecture. On the
hidden layer each neuron have a radial basis activation
function. The activation function of the output layer is
the identity. For bias
'0 = 1.
Then the output of the network looks like:
H
X
yk (x) = wkj 'j (x) + wk0
j=1
;
,
f'
e (x)
( )=W
y x
(10.4)
f holds both weights and bias.
where 'e T '0 : : : 'K and W
✍
f
e, W
❖'
Remarks:
➥
If the basis functions are of Gaussian type the they are of the form:
"
(x ; j )T ;j 1 (x ; j )
'j (x) = exp ;
2
#
where j is a covariant symmetrical matrix.
The model may be represented in the form of a two layer network where the hidden neurons
have the basis functions as activation function. Note that the weight matrix from input to
hidden layer is e1 and the output layer have the identity activation function. See gure 10.1.
The basis function associated with bias is the constant function '0 (x) = 1.
➧ 10.3
10.3.1
Relation to Other Theories
Relation to Regularization Theory
A way to control the smoothing of the model (in order to avoid large oscillations in the
output after a small variation of input) | and thus to control the complexity of the model
10.3
See [Bis95] pp. 171{179.
176
CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS
| would be trough an additional term in the error function which penalizes un-smooth
network mappings.
For a unidimensional output, the sum-of-squares error function will be written as:
1
E=
❖ ,
P
❖ D
Z
P
p
; t ]2 + 2 jP yj2dx
[y (xp )
(10.5)
p
=1
X
where is a constant named regularization parameter and P is a di erential operator such
that large curvatures of y(x) gives rise to large values of jP yj2 .
By replacing y(x ) with y(x ) (x ; x ) | where is the Dirac function | the error
function (10.5) may be expressed as a functional of y(x) (see the mathematical appendix).
The condition of stationarity for E is to set to 0 its derivative with respect to y(x):
p
p
E
y
Euler-Lagrange
equation
Green's functions
2
X
X
D
p
P
=
[y (xp )
=1
D
;t ]
D(
p
x
b y(x) = 0
; xp ) + PP
(10.6)
p
where Pb is the adjoint di erential operator of P (see also the mathematical appendix).
(10.6) is named the Euler-Lagrange equation.
The solutions to (10.6) are found in terms of the Green's functions G of the operator P
which are de ned as being the solutions to the equation:
b G(x; x ) =
PP
0
x
D(
;x )
0
and then the solutions y(x) are searched in the form:
X
P
y (x) =
(10.7)
wp G(x; xp )
=1
p
where fw g
p
p
=1;P
are parameters found by replacing solution (10.7) back into (10.6), giving:
X
P
p
=1
[y (xp )
;t
p ] D (x
X
P
;x
p) +
p
wp D (x
=1
;x
p)
and by integrating around a small enough vicinity of x (small enough such that it will not
contain any other x ), for all p = 1; P :
p
p0
y (xp )
;t
p
+ wp = 0
p = 1; P
(because of the way is de ned).
By replacing the solution (10.7) in the above equation and considering the G matrix and
~ and ~t vectors:
w
D
❖G
(10.8)
0 G(x ; x )
B 1... 1 . . .
G=@
G(x1 ; xP )
G(xP ; x1 )
..
.
1
CA
G(xP ; xP )
10.3. RELATION TO OTHER THEORIES
177
nally (10.8) becomes:
~ T (G + I ) = ~tT
w
By comparing (10.4) and (10.7) it becomes clear that the Green's functions plays the same
role as the basis functions (also the above equation may be compared to (10.3)).
✍
Remarks:
➥
If the P operator is invariant to translation and rotation then G functions are
dependent only on the norm of x: G(x; x ) = G(kx ; x k).
0
10.3.2
0
Relation to Interpolation Theory
Let consider a mapping h : X ! Y . A noise is added to input x. Let p(x) be the probability
density of input and p( ) the probability density of the noise.
Considering the sum-of-squares error function:
e
E=
1
2
ZZ
[y (x + )
X
; h(x)]2 p(x) pe( ) dxd
where h(x) is the desired output (target) for x + .
By making the change of variable x + = z :
E=
1
2
ZZ
[y ( z )
X
❖z
; h(x)]2 p(x) pe(z ; x) dxdz
and the y(x) is found by setting E functional derivative (see the mathematical appendix)
with respect to y(z ) to 0:
E
y
=
Z
X
[y (z )
; h(x)] p(x) pe(z ; x) dx = 0 )
R h x p x pe z ; x dx
X R
p x pe z ; x dx
X
( ) ( ) (
y (z ) =
( ) (
)
)
Considering a suciently large set of training vectors then the integrals may be approximated
by sums:
y (z ) =
XP h x pe x ; xp
p P
P pe x ; xq
p
(
=1
e
❖ p, p
)
(
q=1
)
(
)
and by comparing to (10.4) it becomes obvious that the above expression of y(x) represents
an expansion in basis functions form.
178
CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS
10.3.3
Relation to Kernel Based Method
Considering a set of training data fx ; t g =1 then the estimated probability density of
a pair fx; tg, assuming an exponential smoothing kernel function, is1 :
p
pe(x; t) =
1
P
P
X
p=1
p
p
;P
1
(2L2 )(N +K )=2
exp
; kx ; x k 2L+ kt ; t k
p
2
p
2
2
(10.9)
where L is the length of the hypercube side (N being the dimensionality of the input X
space and K the dimensionality of the Y output space).
The optimal mapping y(x) is given by the average over the desired output conditioned by
the input:
❖L
Z
y(x) =
R
j
t p(t x) dt = R
t p(x; t) dt
Y
Y
p(x; t) dt
Y
R
(by using also the Bayes theorem; p(x; t) dt = p(x)).
Y
By replacing the p(x; t) with the value from (10.9) and integrating (see also the mathematical appendix regarding Gaussian integrals):
P
P
y(x) =
p=1
tp exp
P
P
h
exp
p=1
Nadaraya-Watson
estimator
h
; kx;x2pk
2i
2L
; kx;x2pk2
i
2L
known also as Nadaraya-Watson estimator | a formula similar to (10.4).
➧ 10.4
Classification
For classi cation into a set of C classes the Bayes theorem gives:
k
P (Ck jx) =
p(xjCk )P (Ck )
p(x)
=
p(xjCk )P (Ck )
K
P
q =1
p(xjCq )P (Cq )
(10.10)
(because p(x) plays the role of a normalization factor).
Because the posterior probabilities P (C jx) is what the model should calculate, (10.10) may
be compared with (10.4), the basis functions being:
k
'k (x) =
K
P
q =1
1
10.4
p(xjCk )
p(xjCq )P (Cq )
See the non-parametric/kernel based method in the statistical appendix.
See [Bis95] pp.179{182.
10.4. CLASSIFICATION
x2
179
x2
C1
C1
C2
C2
x1
a)
Figure 10.2:
x1
b)
Bidimensional pattern vector space. The di erence between perceptron based classi ers and radial basis function classi cation: a) The perceptron represents the decision boundaries; b) the radial basis function networks
represents the classes trough the basis functions.
and the output layer having only one connection per neuron from the hidden layer, the
weight being P (Ck ) (in this case the hidden layer/number of basis function is also K ).
The interpretation of this method is that each basis function represents a class | while
the perceptron network represents the hyperplanes acting as decision boundaries. See gure 10.2.
It is possible to improve the model by considering a mixture of basis functions m = 1; M
(instead of one single basis function per class):
p(xjCk ) =
M
X
p
m=1
(xj
m) P (mjCk )
(i.e. the mixture model). The total probability density for x also changes to:
p(x) =
K
X
p
k=1
(xjCk )
P (Ck ) =
M
X
p
m=1
(xj
m) P (m)
where P (m) represents the prior probability of mixture component and it can be expressed
as:
P (m) =
K
X
P mjC
k=1
k ) P (Ck )
(
By replacing the above expressions into the Bayes theorem:
P (Ck jx) = p(xjCpk()xP) (Ck ) =
PM P mjCk p jm P Ck P m M
Pm
Xw '
m
km m
PM p jm P m
m
(
) (x
)
(
(
)
) ( )
=1
=
m=1
(x
)
(
)
=1
(x)
❖
P (m)
180
❖
wkm , 'm
CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS
)
( PP ((m
m)
=1
being added on purpose) where:
p(xjm) P (m) = P (mjx) and
'm (x) = P
M
p(xjn) P (n)
n=1
wkm = P (mPjC(km) P) (Ck ) = P (Ck jm)
(by using the Bayes theorem), i.e. the basis functions represents the posterior probability of
x being generated by the component m of the mixture model and the weights represent the
posterior probability of class Ck membership given a pattern vector generated by component
m of the mixture.
➧ 10.5
Network Learning
The learning procedure consists of two steps:
The basis functions are established and their parameters are found, usually by an
unsupervised learning procedure, i.e. only the inputs fxp g from the training set are
considered (the targets ftp g are ignored). The basis functions are usually de ned to
depend only over a distance between the pattern vector x and the training vectors
fxp g, i.e. kx ; xp k, such that they have a radial symmetry.
Then, having the basis functions properly established, the weights on output layer are
to be found.
10.5.1
Radial Basis Functions
The relation between radial basis function network and other statistical methods described
in section 10.3, suggest that the basis functions should represent the probability density of
input vectors. Then an unsupervised method may be envisaged to nd the parameters of
the basis functions, several are described below.
Subsets of data points
This procedure builds the basis functions as Gaussians:
'j (x) = exp
"
; kx ; j k
2
#
2 j2
The fj g vectors are chosen as a subset of the training set fxp g.
(10.11)
The fj g parameters are chosen all equal to a multiple of average distances between the
centers of radial functions (as de ned by vectors i ). They should be chosen such as to
allow for a small overlap between radial functions.
10.5
See [Bis95] pp. 170{171 and pp. 183{191.
10.5. NETWORK LEARNING
✍
181
Remarks:
➥
This ad hoc algorithm usually gives poor results but it may be useful as a starting
point for further optimization.
Clustering
The basis functions are built as Gaussian, as above.
The training set is divided into a number of subsets Sj , equal to the number of basis
functions, such that an error function associated with the clustering is minimized:
H
E=
kxp ; j k2
j=1 xp 2Sj
XX
At the beginning the learning set is divided into subsets at random. Then the mean j is
calculated inside each subset and each training pattern vector xp is reassigned to the subset
having the closest mean. The procedure is repeated until there are no more reassignments.
✍
Remarks:
➥
Alternatively the j vectors may be found by an on-line stochastic procedure.
First they are chosen at random from the training set. Then they are updated
according to:
j = (xp ; j )
for all xp , where represents a \learning constant". Note that this is similar
to nding the root of rj E , i.e. the minima of E , trough the Robbins-Monro
algorithm.
Gaussian mixture model
The radial basis functions are considered of the Gaussian form (10.11).
The probability density of the pattern vector is considered a mixture model of the form:
H
P (j ) 'j (x)
p(x) =
j=1
X
It is possible to use the EM (expectation-maximisation) algorithm to nd j , j and P (j )
at step s +1 from the values at step s, starting with some initial values 0j , 0j and P0 (j ):
P
P(s) (j jxp ) xp
(s+1)j = p=1P
P(s) (j jxp )
p=1
P
P
v
u
PP P s (jjxp) kxp ; s
u
u
u
p
j =u
u
t N PP P s (jjxp)
(s+1)
=1
( )
( +1)
p=1
( )
j k2
182
CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS
1 X P (j jx )
+1) (j ) =
( )
P
P
P(
s
p
s
p=1
where j = 1; H and:
P (j jx) =
P (j ) '(x)
p(x)
Supervised learning
It is possible to envisage also a supervised learning for the hidden layer. E.g. considering
the basis function of Gaussian form (10.11), y =
k
E=
1
2
P [y ( x ) ; t ]
P
p=1
k
p
kp
2
Pw
H
j =1
kj
' and the sum-of-squares error
j
which have to be minimized then the conditions are:
@E X X
=
w [y (x ) ; t ] exp
@
=1 =1
; kx ; k
XX
@E
=
w [y (x ) ; t ] exp
@
=1 =1
; kx ; k
P
K
kj
j
p
k
K
kj
p
kp
k
p
kp
2
j
2
p
!
2 2
j
2
kx ; k = 0
p
2
j
k
P
sj
p
p
2
!
j
; =0
x
sp
2
j
k
j
3
sj
j
However, to solve the above equations is very computational intensive and the solutions
have to be checked against the possibility of a local minima of E .
10.5.2
Output Layer Weights
f = fw g =1
Considering the matrix W
;'
0
:::'
H
then (10.4) gives:
kj
i
;K
j =0;H
y
(biases included) and the vectorial function 'e T =
f'e (x)
=W
(10.12)
and it may be considered as a single layer network. Then considering the sum-of-squares
error function E = 21
❖
, T
P [y(x ) ; t ] [y(x ) ; t ] the least squares technique is applied to
P
p
p=1
0
1
' (x1 )
H
y
T
p
p
nd the weights.
Considering that y is obtained by the form of a generalized linear discriminant, see (10.12),
then the following matrices are build:
0 ' (x )
=B
@ ... . . .
❖
p
'0 ( x )
..
.
P
' (x )
H
1
CA
and
0t
T =B
@ ... . . .
11
t
K1
P
t1
P
t
..
.
1
CA
KP
f = T which have the solution:
Then (according to the least squares technique) W
;
f = T y where y = T T
W
;
1
10.5. NETWORK LEARNING
183
(y being the pseudo-inverse of ).
Proof.
f = T
W
) Wf ;T = T T ) Wf = T y
.
CHAPTER
11
Error Functions
➧ 11.1
Generalities
Usually neural networks are used to generalize (not to memorize) from a set of training
data.
The most general way to describe the modeller (the neural network) is in terms of join
probability p(x; t), where t is the desired output (target) given x as input. The joined probability may be decomposed in the conditional probability of t, given x, and the probability
of x:
p(
❖
p(
❖
L
x; t) = p(tjx) p(x)
where the unconditional probability of x may be written also as:
p(
x) =
Z
p(
x; t) dt
Y
Most error functions may be expressed in terms of the maximum likelihood function L,
given the fx ; t g =1 set of training vectors:
p
p
p
;P
L=
Y
P
p
=1
Y
P
xp ; tp ) =
p(
p
=1
t jxp ) p(xp )
p( p
which represents the probability of observing the training set fx
p;
11.1
See [Bis95] pp. 194{195.
185
tp g.
x; t)
186
CHAPTER 11. ERROR FUNCTIONS
The ANN parameters are tuned such that L is maximized. It may be convenient instead to
minimize a derived function:
P
P
P
X
X
X
Ee = ; ln L = ;
ln p(tp jxp ) ;
ln p(xp )
) E = ; ln p(tp jxp ) (11.1)
p=1
p=1
p=1
error function
where, in E , the terms p(xp ) were dropped because they don't depend on network parameters. The E function is called error function.
➧ 11.2
Sum-of-Squares Error
It is assumed that the K components of the output vector are statistically independent, i.e.
K
Y
p(tjx) =
p(tk jx)
(11.2)
k=1
(tk being the k-th component of vector t).
It is assumed that the distribution of target data is Gaussian, i.e. it is composed from a
deterministic function h(x) and some Gaussian noise ":
2
"k
1
tk = hk (x) + "k where p("k ) = p
exp ;
2
2
2
2
(11.3)
Then "k = tk ; hk (x), hk (x) = yk (x; W ) because it's the model represented by the neural
network (W being the network parameters), and p("k ) = p(tk jx) (as hk (x) are purely
deterministic):
p(tk jx) =
p1
2 2
exp
) ; tk ]
; [yk (x; W
2 2
2
(11.4)
By using the above expression in (11.2) and then in (11.1), the error function becomes:
P K
1 XX
PK
2
E= 2
ln 2
[yk (xp ; W ) ; tkp ] + P K ln +
2
2
p=1 k=1
❖ tkp
RMS
where tkp is the component k of vector tp . By dropping the last two terms which are weight
(W ) independent, as well as the 1=2 from the rst term, the error function may be written
as:
P K
1 XX
E=
kyk (xp ; W ) ; tkp k2
(11.5)
2
p=1 k=1
✍
Remarks:
➥
11.2
It is sometimes convenient to use one error function for network training, e.g. sum-
See [Bis95] pp. 195{208.
11.2. SUM-OF-SQUARES ERROR
187
of-squares, and another to test over the performance achieved after training, e.g.
the root-mean-square (RMS) error:
PP ky ; t k
p p
P
1 X
p=1
ERMS =
where
h
ti =
tp
PP kt ; htik
P
p
=1
p
p=1
which have the advantage that is not growing with the number of tests P .
11.2.1
Linear Output Units
Generally, the output neuron k is applying the activation function f to the weighted sum
of zj | inputs from (last) hidden layer | in order to compute its own output:
H
X
f)
yk (x; W ) = f (ak )
where ak = wkj zj (x; W
j=0
f is the set of all weights except those involving the output layer, H is the number
where W
of neurons on hidden layer and wk0 is the bias corresponding to z0 = 1 (wkj being the
weight for connection from neuron j to neuron k).
Then, the derivative of error (11.5), with respect to the total input ak into the neuron k,
is:
P
X
df (ak )
@E
=
[yk (xp ; W ) ; tkp ]
@ ak
dak
p=1
Assuming a linear activation function f (x) = x then:
P
X
@E
@E
=
[yk (xp ; W ) ; tkp ] =
@ ak
@ yk
p=1
and the optimization of output layer weights becomes simple. Let:
P
P
X
X
hti = P1 tp and hzi = P1 z(xp )
p=1
p=1
W
w11
..
.
wK 1
1
0e e 1
0e e 1
11
1P
11
1P
C
B
B
.
.
.
.. C
.
.
...
e
e
.
.
.
.
.
=
=
@ . . ... CA
@. . . A
. A
eK1 eKP
KH
eH1 eHP
w1
H
z
;
w
t
z
Z
;
z
z
f,
W
H
, ak , zj
❖ hti, hzi
and the following Wout , Ze and Te matrices are de ned:
0
B
out = @
❖
❖
W
❖
z
out , Ze, Te
t
T
t
t
where Wout holds the weights associated with links between (last) hidden and output layers,
z
ejp = zj (xp ) ; hzj i and etkp = tkp ; htk i. Then the solution for weights matrix is:
eZey where Zey = ZeT(ZeZeT);1
Wout = T
(11.6)
ejp , ekp
t
188
CHAPTER 11. ERROR FUNCTIONS
e y , w0
❖Z
ey
Z
being the pseudo-inverse of Ze. The biases w0T =
w0
k=
y
H
X
j =1
hzi
K 0 are found trough:
(11.7)
kj j + k0
(11.8)
= hti ;
Proof. By writing explicitly the bias:
;
w
Wout
z
w10
:::
w
w
and putting the condition of minimum of E with respect to biases:
@E
@w
❖
h k i, h j i
t
z
k0
k
@y
@w
3
P 2X
H
X
4
=
wkj zj (xp ) + wk0 ; tkp 5 = 0
p=1 j=1
(zj (xp ) being the output of the hidden neuron j when the input of the network is xp ) and then:
k0 = h k i ;
w
bias
=
k
k0
@y
@E
t
H
X
j =1
kj h j i where h k i =
w
z
t
1
P
P
X
p=1
kp
t
;
h ji =
1
z
P
P
X
p=1
j (xp )
z
i.e. the bias compensate for the di erence between the average of target and the weighted average of (last)
hidden layer output.
By replacing the biases found, back into (11.5) (through (11.8)):
E
=
1
2
32
P X
K 2X
H
X
4 wkj zejp ; etkp 5
(11.9)
p=1 k=1 j =1
The minimum of E with respect to wkj is found from:
@E
kj
@w
=
P "X̀
X
p=1 s=1
#
e
wks z
esp ; tkp zejp = 0 ;
k
= 1; K
;
j
= 1; H
(11.10)
and the set of equations (11.10) may be condensed into one matrix equation:
e eT ; TeZeT = e0
(11.11)
Wout Z Z
which yields the desired result.
✍
Remarks:
➥
➥
outliers
➥
It was assumed that ZeZeT is invertible.
f weights xed. However if
The solution (11.6) was found by maintaining the W
they change, the optimal solution (11.6) changes as well.
The sum-of-squares error function is sensitive to training vectors with high error
| called outliers . It is also sensitive to misslabeled data because it leads to high
error.
11.2.2 Linear Sum Rules
❖ u,
u0
Let assume that for all targets (desired outputs) from the training set it is true that:
uT tp
+ 0 = 0 where
u
u
; u0
= const. 8
;
p
By summing over all training vectors it follows that u0 = ;uT hti and then:
uT tp
= ;uT hti
(11.12)
11.2. SUM-OF-SQUARES ERROR
189
Then it is also true that:
= uT hti
uT y
i.e. if the training set have the property (11.12) then any network output will have the same
property.
The network output is y = out z + w0 and also from (11.7): w0 = hti ; out hzi. Then:
uT y = uT ( out z + w0 ) = uT e e y (z ; hzi) + uT hti
Proof.
W
W
W
TZ
But the elements of the row matrix (vector) uT Te are:
fuT eT gp = uT e(:
T
T
; p)
; hti) = 0
= uT (tp
by virtue of (11.12).
11.2.3 Signi cance of Network Output
Let consider that the number of training data sets grows to in nity and the following error
function (using Euclidean norm):
1
= lim
!1 2
E
P
X
= 21
= 12
ky(x
p; W
P
P
=1
[ (x
);
[ (x
) ; ]2 ( jx) (x)
yk
=1 X;Y
p; W
yk
=1 X;Y
tkp
]2 (
p tk ;
x) dtk dx
k
ZZ
K
X
k
(11.13)
p
p
ZZ
K
X
k
) ; t k2
p; W
tk
p tk
p
x
dtk d
k
where Y is the unidimensional component of the output space Y related to t .
The following conditional averages are de ned:
k
❖
k
h jxi
Z
( jx)
tk p tk
tk
and ht2 jxi
dtk
Z
2
( jx)
tk p tk
k
❖
(11.14)
dtk
Yk
Yk
Then, by using the above de nitions, it's possible to write:
[y ; t ]2 = [y ; ht jxi + ht jxi ; t ]2
k
k
k
k
k
k
= [ ; h jxi]2 + 2( ; h jxi)(h jxi ; ) + [h jxi ; ]2
and by replacing into the expression of , the middle term cancels after integration over
(h jxi ; ! h jxi ; h jxi = 0)
yk
tk
yk
tk
tk
tk
tk
tk
E
tk
tk
tk
XZ
1
[ (x
=2
=1
K
E
yk
k
X
tk
tk
;W
XZ
1
) ; h jxi] (x) x + 2
[h 2 jxi ; h jxi2 ] (x)
=1
tk
2
K
p
d
tk
k
tk
(upon integration over t : the rst term is independent of p(t jx) and
k
p
d
x
(11.15)
X
R
k
Yk
( jx)
p tk
dtk
=1
as p(t jx) is assumed normated, while for the second term [ht jxi ; t ]2 ! ht jxi2 ;
2ht jxi2 + ht2 jxi).
k
k
k
k
k
k
Yk
h jxi, h 2 jxi
tk
t
k
190
CHAPTER 11. ERROR FUNCTIONS
j
p(t x)
y (x)
t
x
x
Figure 11.1:
The network output signi cance. Unidimensional input
and output space. Note that htjxi doesn't necessary coincide with the maximum of p(tjx).
In the expression (11.15) of the error function, the second term does not depend upon
network parameters W and may be dropped; the rst term is always positive and then
E > 0. The error becomes minimum (zero) when the integrand in the
rst term is zero, i.e.
yk (x; W
❖
W
)=
h jxi
(11.16)
tk
where W denotes the nal weights, after the learning process have been nished (i.e.
optimal W value).
The ANN output with sum-of-squares error function represents the average of target conditioned on input. See also gure 11.1.
To obtain the above result the following assumptions were made:
The training set is suciently large: P ! 1.
The number of weights (w parameters) is suciently large.
The absolute minimum of error function was found.
✍
Remarks:
➥
➥
The above result does not make any assumption over the network architecture
or even the existence of a neuronal model at all. It holds for any model who try
to minimize the sum-of-squares error function.
As p(t jx) is normated then each term in the sumRappearing in last expression
of E in (11.13) may be multiplied conveniently by p(t jx) dk0 = 1 such that
k
k0
E
Yk 0
may be written as:
E
=
1
ZZ
2
ky(x
p; W )
; tk2 (tjx)
p
p(x) dt dx
X;Y
(ft g were assumed statistically independent, p(tjx) =
k
Q
K
k
=1
j ) and then the
p(tk x)
11.2. SUM-OF-SQUARES ERROR
191
proof of (11.16):
hj i
y(x; W ) = t x
may be expressed similarly as above but directly in vectorial terms.
The same result may be obtained faster by considering the functional derivative of E with
respect to y (x) which is set to zero to nd its minima:
k
ZZ
E
yk (x)
[yk (x; W )
=
;t
k]
j
p(tk x) p(x) dtk dx = 0
X;Yk
The right term may be split in two integrals,
normated and then:
yk (x; W )
is independent of t ,
j
h ji
Z
k
j is
p(tk x)
tk p(tk x) dtk = tk x
yk (x; W ) =
Yk
R
(the integrand of have to be zero).
In general the result (11.16) represents the optimal solution: assuming that the data is
generated from a set of deterministic functions h (x) with superimposed zero-mean noise
then:
X
k
)
tk = hk (x) + "k
h ji h
ji
yk (x) = tk x = hk (x) + "k x = hk (x)
The variance of the target data, as function of x, is:
2
Z
[tk
k (x) =
; ht jxi]2 p(t jx) dt
k
k
k
h j i ; ht jxi2
2
= tk x
k
(11.17)
Yk
([t ; ht jxi]2 ! ht2 jxi ; 2ht jxi2 + ht jxi2 ) i.e. is exactly the residual term of the error
function (11.15) and it may be used to assign error bars to the network prediction.
k
k
✍
k
k
k
Remarks:
➥
Using the sum-of-squares error function, the network outputs are x-dependent
means of the distribution and the average variance is the residual value of the
error function at its minimum. Thus the sum-of-squares error function cannot
distinguish between the true distribution and a Gaussian distribution with the
same x-dependent mean and average variance.
11.2.4 Outer product approximation of Hessian
From the de nition (11.13) of error function, the Hessian terms are:
@2E
@ws @wq
=
K Z
X
@yk @yk
k
=1 X
@ws @wq
p(x) dx +
K Z
X
k
=1 X
@ 2 yk
@ws @wq
(yk
; ht jxi)p(x) dx
k
192
CHAPTER 11. ERROR FUNCTIONS
where ws , wq are some two weights, also theRintegration over tk was performed (yk is
independent of p(tk jx) which is normated, i.e. p(tk jx) dtk = 1).
Yk
The second term, after integration over x, cancels because of the result (11.16), such that
the Hessian becomes | after reverting back from the integral to the discrete sum:
P X
K
@2E
1X
@yk (xp ; W ) @yk (xp ; W )
=
@ws @wq P p=1 k=1
@ws
@wq
➧ 11.3
Minkowski Error
A more general Gaussian distribution for the noise ", see (11.3), is:
p("k ) =
R
1=R
2;E (1=R)
exp
;
; j"k jR
= p(tk jx) ; R = const.
(11.18)
where ;E is the Euler function1 .
By a similar procedure as used in section 11.2 (and using the likelihood function) the error
function becomes:
E=
Minkowski error
P X
K
X
p=1 k=1
jyk (xp ; W ) ; tkp jR
which is named Minkowski error.
The derivative of the error function, with respect to the weights, is
P
K
xp ; W )
@E X X
=
jy (x ; W ) ; tkp jR;1 sign(yk (xp ; W ) ; tkp ) @yk (@w
@ws p=1 k=1 k p
s
which may be evaluated by the means of backpropagation algorithm (ws being here some
weight).
✍
Remarks:
➥
The constant in front
of exp function in (11.18) ensures the normalization of
R
probability density: p("k ) d"k = 1.
Yk
➥
LR norm
11.3
1
Obviously, for R = 2 it reduces to the Gaussian distribution. For R = 1 the
distribution is called Laplacian and the corresponding error function city-block
metric.
More generally the distance jy ; tjR is named LR norm.
See [Bis95] pp. 208{210.
See the mathematical appendix.
11.4. INPUT-DEPENDENT VARIANCE
➥
193
The use of R < 2 reduces the sensitivity to outliers. E.g. by considering a network
with one output and R = 1 then:
X
P
E=
jy(x
p
p
=1
;W)
;t j
p
and, at minimum, the derivative is zero:
X
P
p
=1
;t )=0
sign(y (xp ; W )
p
condition which is satis ed if there are an equal number of points t
are t < y, irrespective of the value of di erence.
p
>y
as there
p
➧ 11.4
Input-dependent Variance
Considering the variance as depending on the input vector
distribution of the noise then, from (11.4):
p(tk
jx) = p
1
exp
2 k (x)
; [y
k(
k = k (x)
x; W ) ; t
2 2 (x)
k]
2
and a Gaussian
k
By the same means as used in section 11.2 for the sum-of-squares function (by using the
likelihood function), the error may be written as:
X X [y ( x
P
E=
K
k
p
=1 k=1
p
;W)
;t
2
kp ]
2k2 (xp )
+ ln k (xp )
Dividing by 1=P and considering the limit P ! 1 (in nite training set) then:
X ZZ [y (x; W ) ; t
E=
K
k
k
=1X;Y
k
k]
2k2 (x)
2
+ ln k (x) p(tk
jx) p(x) dt
k
dx
and, the condition of minima with respect to y is:
E
yk
= p(x)
Z
k
yk (x; W )
;t
k
k2 (x)
Yk
p(tk
jx) dt
k
=0
which means that the output of network, after the training was done, is:
h jxi
yk (x; W ) = tk
Similarly, the condition of minima for E with respect to is:
E
k (x)
11.4
= p(x)
See [Bis95] pp. 211{212.
Z
Yk
k
[y
;
(x)
1
k
x) ; t
3 (x)
k(
k]
k
2
p(tk
jx) dt
k
=0
194
CHAPTER 11. ERROR FUNCTIONS
and, after the training (E minimum), the variance is (see also (11.17)):
k2 (x) = h[tk ; htk jxi]2 jxi
The above result may be used to nd the variance as follows:
First a network is trained to minimize the sum-of-squares error function, using the
fxp ; tp gp=1;P training set.
The outputs of the above network are yk = htk jxi. These values are subtracted from
the target values tk and the result is squared. Together with the input vectors xp they
form a new training set fxp ; (tp ; htp jxi)2 gp=1;P .
The new set is used to train a new network with a sum-of-squares error function. The
outputs of this network are k2 = h[tk ; htk jxi]2 jxi.
➧ 11.5
Modeling Conditional Distributions
A very general framework of modeling conditional distributions is to build a model in two
stages:
The rst stage uses the input vectors xp to model | trough an ANN | some parameters (x).
The parameters are used into a parametric model (non ANN) to nd the conditional
probability density p(tjx).
✍
Remarks:
The above approach may deal well with complex distributions. By comparison,
a neural network with sum-of-squares error function may model just Gaussian
distributions with a global variance parameter and a x-dependent mean.
As parametric model, a good choice is the mixture model. In this approach, the distribution
➥
❖
M
PM
of p(x) is considered of the form: p(x) =
p(xjm) P (m) where M is the number of
m=1
mixture components m.
On similar grounds, the probability distribution p(tjx) may be expressed as:
p(tjx) =
❖ m
M
X
m=1
m (x) 'm (tjx)
(11.19)
where m (x) are prior probabilities, conditioned on x, of the target vector t being generated
by the m-th component of the mixture. Being probabilities they have to satisfy the constraint
of normality:
M
X
m=1
11.5
See [Bis95] pp. 212{222.
m (x) = 1
and
m (x) 2 [0; 1]
11.5. MODELING CONDITIONAL DISTRIBUTIONS
195
The kernel functions 'm (tjx) represents the conditional density of the target vector t for
the m-th mixture component. One possibility is to choose them to be of Gaussian type:
2
'm (tjx) = (2)K=12 K (x) exp ; kt ;22m(x(x))k
(11.20)
m
m
and then the outputs of the neural network may be de ned as follows:
A set of outputs fy mg for the m parameters which will be calculated trough a
softmax function:
y m)
exp(y n )
exp(
m= P
M
n=1
❖ 'm
y m , ym ,
ykm
❖
softmax function
(11.21)
A set of outputs fym g for the m parameters which will be calculated trough the
exponential function:
m = exp(ym )
(11.22)
A set of outputs fykm g for the m parameters which will be represented directly by
the network outputs:
km = ykm
The error function is build from the likelihood:
E = ; ln L = ; ln
P
Y
p=1
p(tp jxp )
!
;
=
P
X
p=1
ln
M
X
m=1
m (xp ) 'm (tp jxp )
!
The problem is to nd the network weights in order to minimize the error function. This
is done by nding an expression for the derivatives of E with respect to network outputs
y( ) and then the weights are updated by using the backpropagation algorithm. The error
function will be considered as a sum of P components Ep :
❖
Ep
Ep = ; ln
M
X
m=1
The posterior probability that the pair
mixture is:
m (xp ) 'm (tp jxp )
n=1
M
P
m=1
(11.23)
; was generated by the component m of the
(x t)
m (x) 'm (tjx)
m (x; t) = P
M
n (x) 'n (tjx)
and, obviously, is normated (
!
m = 1).
❖ m
196
CHAPTER 11. ERROR FUNCTIONS
@E
@y m
The
derivatives.
From (11.23): @@Enp
=
; nn and from (11.21):
@ n
@y m
nm n ; n m
=
(nm being the Kronecker symbol). Ep depends upon y m trough all n parameters, then:
@Ep
@y m
M @E @
X
p
n
=
n=1 @ n @y m
=
m ; m
(by using the property of normation for k as well).
@E
@ym
The
derivatives.
From (11.23) and (11.20):
@Ep
@m
m
Also from (11.22) dydm
=
@Ep
@ym
@E
@ykm
The
=
2
;m kt ;3 m k ; K
m
m
m and then:
=
@Ep dm
@m dym
=
2
;m kt ;2 m k ; K
m
derivatives.
Since km = ykm , then from (11.23) and (11.20) (Euclidean norm):
km ; tk
@Ep
@Ep
=
= m
2
@ykm @km
m
The conditional average of the target data is:
htjxi =
❖
s
Z
Y
j
tp(t x) dt =
M
X
m=1
m (x)
Z
Y
j
t'm (t x) dt =
(from (11.19) and (11.20); it was also assumed that Y
mathematical appendix).
The variance s(x) is:
s
2
Proof.
(x)
hkt ; htjxik jxi =
2
M
X
m=1
"
m (x) m (x) +
2
=
M
X
m (x) m (x)
m=1
(11.24)
RK , see Gaussian integrals in the
M
X
n=1
n (x) n (x) ; m (x)
2
#
From the de nition of average and using (11.19) and (11.20) (also: Euclidean norm and Yk = R):
s2 (x) =
Z k ; h j ik p j
RK
t
t x 2 (t x) dt =
XK Z tk ; htk j i
k=1 R
(
j
x )2 p(tk x) dtk
11.6. CLASSIFICATION USING SUM-OF-SQUARES
=
=
K X
M
X
k=1 m=1
K X
M
X
k=1 m=1
m (x)
Z
R
197
tk ; htk jxi)2 'm (tk jx) dtk
(
m (x)
K (x)
(2 )K=2 m
Z
R
[t ; (x)]2
dtk
; k 22km(x)
tk ; htk jxi)2 exp
(
m
To calculate the Gaussian integral above rst make a change of variable etk = tk ;htk jxi and then a second
one btk = etk + htk jxi ; km (x) forcing also a squared btk . This leads to:
Z
R
(
tk ; htk jxi)2 exp
etk , btk
[t ; (x)]2
(t + ht jxi ; (x))2
Z
k
km
dtk = etk exp ; k
detk
; k 22km(x)
2 2 (x)
m
R
!
m
Z
2
bt2 ! b
btk exp ; 2btk
dbtk ; (htk jxi ; km )2 exp ; 2k
dtk
2m (x)
2m (x)
R
R
Z
bt2k ! b
b
; 2(htk jxi ; km ) tk exp ; 22 (x) dtk
Z
=
❖
R
m
First term, after an integration by parts, leads to a Gaussian (see mathematical appendix) and equals
p2 2 (x) while in the third term the integrand
p
2 3 (x); the second one is directly a Gaussian giving
m
m
is an odd function, integrated over a domain symmetrical around origin, so it is zero. The sought result is
obtained after the (11.24) replacement.
➧ 11.6
Classification using Sum-of-Squares
If the sum-of-squares is used as error function then the outputs of neurons represent the
conditional average of the target data y = htjxi as proven in (11.16). In problems of
classi cation the t vectors represent the class labels. Then the most simple way of labeling
is to assign an network output to each class and and to consider that an input vector
x is belonging to that class C represented by the (output) neuron k with the biggest
output. This means that the target for a vector x 2 C is ft = g =1 ( being the
Kronecker symbol). This method is named one-of-k encoding. The p(t jx) probability may
be expressed as:
k
k
k
kq
q
kq
;K
k
j
X
K
p(tk x) =
q
D (tk
=1
;
kq )
C jx)
P(
q
( being the Dirac function and P (C jx) is the probability of the
given x).
From (11.16) and the above expression of p(t jx):
D
q
XZ
q
=1 Y
2 C event, for a
q
k
K
yk (x) =
x
kq D (tk
;
kq )
P(
C jx) dt
q
k
C jx)
= P(
k
k
i.e. when using the one-of-k encoding the network outputs represent the posterior probability.
11.6
See [Bis95] pp. 225{230.
one-of-k
198
CHAPTER 11. ERROR FUNCTIONS
The outputs of network being probabilities must sum to unity and be in the range
:
[0; 1]
By the way the targets are chosen (1-of- ) and the linear sum rules (see section 11.2.2)
k
the outputs of the network will sum to unity.
The range of the outputs may be ensured by the way the activation function is chosen
(e.g. logistic signal activation function).
11.6.1
Hidden Neurons
The error function (11.9) may be written in a matrix form:
E
❖
S
B , ST
=
1
2
f
e
Tr (Wout Z
; e)T (
e
; e)g
T
Wout Z
B
eTeT TeZ
eT
=Z
(11.25)
T
and, by replacing the solution (11.6):
E
=
1
2
B ST;1 ) where
;
eT Te
Tr(T
S
S
T
; S
eZ
eT
=Z
(11.26)
As Zey = ZeT (ZeZeT ) and for two matrices (AB )T = (B T AT ) then:
o
T
n
1
TeZeyZe ; Te TeZeyZe ; Te = 1 Tr ZeT ZeyTTeT ; TeT TeZeyZe ; Te
Tr
Proof.
E= 2
=
=
1
2
1
2
Tr
Tr
nh
n
2
ih
ZeT (ZeZeT );1T ZeTeT ; TeT TeZeT (ZeZeT );1 Ze ; Te
io
ZeT (ZeZeT );1T ZeTeT TeZeT (ZeZeT );1 Ze ; ZeT (ZeZeT );1T ZeTeT Te ; TeT TeZeT (ZeZeT );1 Ze + TeT Te
o
For two matrices A and B , such that B have same dimensions as AT , it is true that Tr(AB ) = Tr(BA).
This property is used in rst and third terms, by moving Ze from left to right, thus they become identical
and cancel out. As Tr(AT ) = Tr(A), the second term is transposed (again (AB )T = B T AT ) and then, Ze
is moved from left to right:
n
o
E = 1 Tr ;ZeTeT TeZeT (ZeZeT );1 + TeT Te
2
Minimizing the error becomes equivalent to maximize the expression:
J
=
1
2
Tr
;
S
B ST;1
(11.27)
The ST matrix may be written as (see the de nition of Ze):
T
S
❖
P
k , hziCk
P
X
eZ
eT =
=Z
p=1
[z(xp )
; hzi][z(xp ) ; hzi]T
(11.28)
i.e. it represents the total covariance matrix of (last) hidden layer with respect to the training
set.
P
z(xp ) (i.e. the mean of hidden
Let Pk be the number of xp 2 Ck and hziCk = P1k
xp 2Ck
neurons output over the training vectors belonging to a single class). Then:
B
S
=
K
X
k=1
k (hziCk ; hzi)(hziCk ; hzi)
P
2
T
(11.29)
11.6. CLASSIFICATION USING SUM-OF-SQUARES
Proof.
SB
=
ZeTeT TeZeT = TeZeT
199
T
TeZeT (as (AB)T = BT AT ). Considering the one-of-k encoding
P
P
P
tkp = PPk , also hzj i = P1 P zj (xp ) and then:
then htk i = P1
p=1
p=1
etkp = tkp ; Pk
P
and ZeT (p; j ) = zj (xp ) ;
P
P
P
X
P p=1 zj (xp )
1
tkp zj (xp ) = Pk hzj iCk due to one-of-k encoding:
p=1
P
P
X
X
TeZeT (k; j ) =
tkp ; PPk zj (xp ) ; P1
zj (xp ) = Pk (hzj iCk ; hzj i)
p=1
p =1
Each element of TeZeT is calculated directly, note that
0
0
The nal result is obtained by doing the TeZeT
T
TeZeT multiplication.
The processing on hidden neuron layer(s) may be seen as non-linear transformation such
that (11.27) is maximized. The (11.27) expression have a strong resemblance with the
Fisher criterion. The output of (last) hidden layer have the role to generate maximum discrimination between classes, optimum for a linear transformation (performed by the output
layer).
✍
Remarks:
➥
Note that SB contains the multiplicative factor Pk2 under the sum, fact which
strongly raises the importance of classes well represented in the training set, see
section 11.6.2.
11.6.2 Weighted Sum-of-Squares
For the sum-of-squares networks, the minimization of the error function results into a maximisation of the J term (see (11.27)) which for the one-of-k encoding is dependent upon the
observed prior probabilities P (Ck ) = Pk =P of the training set. If these prior probabilities
di er in the training set from the true P (Ck ) then the (trained) classi er (model) built will
be suboptimal.
To correct the above problem the error function may be modi ed by inserting a weighting
factor p for each pattern; the error becoming:
e
E=
✍
1
2
XP pk
p=1
y(xp ) ; tp k2 where p =
P (Ck )
for xp 2 Ck
P (Ck )
e
Remarks:
(11.30)
The prior probabilities P (Ck ) are usually not known, then they may be assumed to
be the probabilities related to some test patterns (patterns which are run trough
the net but not used to train it) or the adjusting coecients p may be even
estimated by other means.
Considering the identity activation function f (a) = a then:
➥
@E
@ak
=
@E df
@yk dak
=
@E
@yk
=
XP p k
p=1
[y (xp ) ; tkp ]
e
❖ P (Ck ), P (Ck )
❖ p
200
CHAPTER 11. ERROR FUNCTIONS
where (bias was considered):
yk
=
K
X
q=1
(11.31)
wkj zj + wk0
Then the condition of minima of E with respect to wk0 is:
1
0H
P
X
X
@E @yk
@E
w z (x ) + wk0 ; tkp A = 0
=
=
@
@wk0 @yk @wk0 p=1 p j =1 kj j p
❖ htk i, hzj i
the solution being:
wk0 = htk i ;
H
X
j=1
wkj hzj i
PP t
htk i = p=1P
P
;
p kp
p=1
;
p
PP z (x )
pj p
hzj i = p=1 P
P
p=1
p
(11.32)
From (11.30), using (11.31) and (11.32), by same reasoning as for (11.9) and with the
same meaning for zejp and etkp :
12
0
0H
12
P X
P K H
K X
X
1 XX X
p
p
@ wkj zejp p ; etkp pA
@ wkj zejp ; etkp A =
p
E=
2
2
1
p=1
❖K
p=1 k=1 j=1
k=1 j=1
The matrix of p coecients is build as:
0p
BB 0 1
K =B
B@ ...
0
0
...
1
.. C
. C
CC
A
p0
P
0
BB 01
K T K = KK T = B
B@ ...
0
...
0
)
0
and then, by the same reasoning as for (11.25):
E
e e
❖ Z 0, T 0
=
1
2
Tr
Wout Ze
0
; Te
0
T
Wout Ze
0
; Te
0
0
...
...
0
e and Te = TeK .
where Ze = ZK
On the other hand, the condition of minimum for E , relative to the weights, is:
!
P
H
X
X
@E
ze
=
w ze ; et
@wkj p=1 p q=1 kq qp kp jp
!
P
H
X
X
p
p
p
=
p
wkq zeqp p ; etkp p zejp p = 0
p=1
q=1
0
0
1
.. C
.C
CC
0 A
0
P
(11.33)
11.6. CLASSIFICATION USING SUM-OF-SQUARES
or, in matrix notation, similar to (11.11):
e0 Ze0 T
Wout Z
201
; Te0 Ze0T = 0
(11.34)
which leads to the solution (see also (11.6) and related calculations: results are similar, by
making the substitution Ze $ Ze0 and Te $ Te0 ):
e0y
e0y = Ze0 T (Ze0 Ze0 T );1
Wout = Te0 Z
where
Z
and the error function may be written as (see (11.26)):
E=
1
2
Tr
n
Te0 T Te0
; SB0 ST0 ;1
o
where SB0 = Ze0 Te0T Te0Ze0 T and ST0 = Ze0 Ze0 T .
Similar to (11.28) and (11.29), considering the de nition of p and the one-of-k encoding,
the SB0 and ST0 matrices may be written as:
ST0 =
0 =
SB
K P (C ) X
X
k
Pk
k=1
K
X
k=1
xp 2Ck
[z(xp )
; hzi][z(xp ) ; hzi]T
Ck ) (hziC ; hzi)(hziC ; hzi)T
P 2P 2(
k
k
The proof for SB is very similar to the proof of (11.29). Note that in one-of-k encoding:
P
P (Ck )
Pk hzj i k
p tkp zj (xp ) =
P
(Ck )
p=1
(of course, assuming that the training set is correctly classi ed) and so forth for the rest of terms. Also
P (Ck ) = Pk =P .
X
0
Proof.
e
e
11.6.3
C
Loss Matrix
Penalties for misclassi cation may be introduced in the one-of-k encoding by changing the
targets to:
tkp = 1
(
; Lk` for xp 2 C`
where
Lk` =
if ` = k
0
2 [0; 1] otherwise
L being the loss matrix. Note that for Lk` = 1 ; k` the situation is reduced to the one-of-k
case.
Considering the loss matrix, SB becomes:
SB =
K "X
K
X
k=1 `=1
P` (1
; Lk` )(hziC ; hzi)
`
X
` =1
#T
P`0 (1
; Lk` )(hziC ; hzi)
0
`0
0
Same as for proof of (11.29):
P
K
tkp zj (xp ) =
p=1
`=1 xp
and so forth.
Proof.
#" K
X
XX
2C`
(1
; Lk` )zj (xp ) =
XK P`
`=1
(1
; Lk` )hzj i
C`
0 , S0
❖ SB
T
202
CHAPTER 11. ERROR FUNCTIONS
➧ 11.7
Cross Entropy
11.7.1
Two Classes Case
Considering a two classes problem, a one network output is discussed, such that the target is
either t = 1, for pattern vectors belonging to C1 , or otherwise t = 0 (x 2 C2 ), i.e. the network
output represents the posterior probabilities P (C1 jx) = y(x) and P (C2 jx) = 1 ; y(x). Then
p(tjx) may be written as:
jx) =
p(t
y
p )[1 ; y (xp )]
t (x
1;t
(11.35)
and for the whole training set | giving the likelihood:
Y
P
L=
p=1
p jxp ) =
p(t
Y
P
p=1
y
p ) [1 ; y (xp )]
tp (x
1;tp
The error function may be taken as the negative logarithm of the likelihood function (as
previously discussed):
E
cross-entropy
=
; ln L = ;
X
P
p ln y (xp ) + (1 ; tp ) ln[1 ; y (xp )]
t
p=1
also known as cross-entropy .
The error minima, with respect to y(xp ), is found by zeroing its derivatives:
@E
@ y(
❖
E
min
=
xp )
tp ; y (xp )
xp ) [1 ; y(xp )]
y(
the minimum occurring for y(xp ) = tp , 8p 2 f1; : : : ; P g; the value being:
min
E
✍
=
;
X
P
p=1
p ln tp + (1 ; tp ) ln(1 ; tp )
(11.37)
t
Remarks:
➥
Considering the logistic activation function y = f (a):
f (a)
❖
(11.36)
=
1
1+e
;a
;
df
da
=
f (a)
[1
;
f (a)]
then the derivative of error with respect to a(xp ) (total input to output neuron
when presented with xp ) is:
xp )
a(
@E
@ a(
11.7
See [Bis95] pp. 230{240.
xp )
=
@E
@ y(
xp )
xp ))
da(xp )
df (a(
=
xp ) ; tp
y(
11.7. CROSS ENTROPY
203
For one-of-k encoding, either tp or 1 ; tp is 0 and then Emin. = 0. For the general case, the
minimum value (11.37) may be subtracted, from the general expression (11.36) such that
E becomes:
E=;
P
X
p=1
tp ln y(txp ) + (1 ; tp ) ln 1 ;1 ;y(txp )
p
(11.38)
p
11.7.2 Sigmoidal Activation Functions
Let assume that the posterior probabilities of outputs z of hidden neurons, with respect to
classes Ck (k 2 f1; 2g) is of the exponential general form:
h
i
p(zjCk ) = exp A(k ) + B (z; ') + Tk z
(11.39)
where k and ' are some parameters de ning the form of distribution and A and B are
some ( xed) functions.
From the Bayes theorem:
❖ k ,
', A, B
P (C1 jz) = p(zjCp1()zP) (C1 ) = p(zjC )Pp((CzjC)1+)Pp((Cz1jC) )P (C )
1
1
2
2
1
p(zjC1 )P (C1 )
=
where a = ln
1 + exp(;a)
p(zjC2 )P (C2 )
By replacing (11.39) in the expression of a (above)
❖
a, w , w 0
P (C1 jz) = 1 + exp[;(1wT z + w )] where w = 1 ; 2
0
w0 = A(1 ) ; A(2 ) + ln PP ((CC1 ))
2
i.e. the network output is represented by a logistic sigmoid activation function applied to the
weighted sum of hidden neurons outputs (obviously only those connected to the network
output).
11.7.3 Cross-Entropy Properties
Let "p = y(xp ) ; tp be the error for input pattern xp .
becomes:
E=;
P
X
p=1
tp ln
1+
Then the cross-entropy (11.38)
"p + (1 ; t ) ln 1 ; "p
p
tp
1 ; tp
(11.40)
and it can be seen that it depends on relative error ("p =tp , "p =(1 ; tp )).
✍
Remarks:
➥
The cross-entropy error function will try to minimize the relative error and thus
giving evenly results for any kind of targets (large or small) while compared with
sum-of-squares error function which tries to minimize the absolute errors (thus
giving better results with large target values).
❖ "p
204
CHAPTER 11. ERROR FUNCTIONS
Let consider binary outputs, i.e. tp
(11.40) error becomes:
E=
;
= 1
X
xp 2C1
for xp 2 C1 and tp
p
ln(1 + " )
;
X
xp 2C2
ln(1
= 0
for xp 2 C2 . Then the
; "p )
(by splitting the sum in (11.40) in two, separately for each class).
✍
Remarks:
➥
Considering the absolute error "p small, then ln(1 "p ) ' "p; also for xp 2
C1 ) tp = 1 and yp 2 [0; 1], then, obviously "p < 0; similarly, for xp 2 C2 ,
P
P
"p > 0, then E '
j"p j.
p=1
For an in nitely large training set the sum in (11.36) transforms into an integral:
E=
❖ htjxi
y (x)
;
Z Z1
X
ft ln y(x) + (1 ; t) ln[1 ; y(x)]g p(tjx) p(x) dt dx
0
is independent of t and then, by integration over t:
E=
;
Z
X
h jxi ln y(x) + (1 ; htjxi) ln(1 ; y(x))] p(x) dx
[ t
where htjxi =
Z1
j
tp(t x) dt
(11.41)
0
The value of y(x) for which E is minimum is found by zeroing its functional derivative:
E
y
Z
=
X
htjxi ; 1 ; htjxi p(x) dx = 0
y (x)
1 ; y (x)
and then y(x) = htjxi (E = 0 if and only if the integrand is 0), i.e. the output of network
represents the conditional average of the target data for the given input.
For the particular encoding scheme chosen for t, the posterior probabilities p(tjx) may be
written as:
p(tjx) = D (1 ; t)P (C1 jx) + D (t)P (C2 jx)
(11.42)
(D being the Dirac function). Substituting (11.42) in (11.41) and integrating gives y(x) =
P ( 1 x) i.e. the output of network is exactly what it was supposed to represent.
Cj
11.7.4
Multiple Independent Features
So far only one property of the input vectors x was present and studied in the network
output | its membership to a particular class | property mutually exclusive.
11.7. CROSS ENTROPY
205
To watch out for multiple, non exclusive, features a network with multiple outputs is required
and then y (x) will represent the probability that the k-th feature is present in x input vector.
k
j
Assuming independent features then:
p(t x) =
K
Q
k
=1
j
p(tk x).
The presence or absence of the k-th feature may be used to classify x as belonging to one
of 2 classes, e.g. C10 if it does have it and C20 if it doesn't. Then, by the way the meaning of
y was chosen, the posterior probabilities p(t jx) may be expressed as in (11.35) and:
k
k
j
tk
p(tk x) = yk (1
;
yk )1;tk
)
j
p(t x) =
K
Y
k
L=
P
Y
p
j
p(tp xp ) =
=1
P
K
Y
Y
p
=1 k=1
tkp
yk (xp ) [1
=1
yktk (1
;y
k(
1;tk
;y
k)
xp )]1;tkp
In the usual way, the error function is build as the negative logarithm of the likelihood
function:
E=
; ln L = ;
P
K
X
X
p
ft
=1 k=1
kp
ln yk (xp ) + (1
;t
kp )
ln[1
; y (x )]g
p
p
Because the network outputs are independent then for each y the analysis for one output network applies. Considering this, the error function may be changed the same way
as (11.38):
k
E=
;
P
K
X
X
tkp ln
p
11.7.5
yk (xp )
tkp
=1 k=1
+ (1
;t
kp ) ln
1
; y (x
1;t
p)
k
kp
Multiple Classes Case
Let consider the problem of classi cation with a set of mutually exclusive classes fC g =1 ,
and the one-of-k encoding scheme, i.e. t = for the input vector x 2 C .
The output of (output) neuron k represents the posterior probability of C : P (C jx) =
k
kp
yk (x);
k`
p
thus the posterior probability density of t is:
p
j
p(tp xp ) =
the two classes case, see (11.35)). Then:
L=
P
Y
p
=1
j
p(tp xp ) =
P
K
Y
Y
p
=1 k=1
)
ykkp (xp )
t
E=
k
; ln L = ;
tkp
=1
The error function have a minimum when all the output values y
targets t :
Emin =
;
P
K
X
X
=1 k=1
p
tkp ln tkp
k
(similarly to
tkp ln yk (xp )
=1 k=1
xp )
k(
kp
yk (xp )
P
K
X
X
p
;K
`
k
K
Q
k
coincide with the
206
❖
Ep
CHAPTER 11. ERROR FUNCTIONS
and, as in (11.38), the minimum may be subtracted from the general expression, and the
energy becomes:
P X
K
P
X
X
E=;
tkp ln ykt(xp ) = ; Ep
(11.43)
kp
p=1
p=1 k=1
K
P
where Ep = tkp ln ykt(kpxp ) .
k=1
To represent the posterior probabilities P (Ck jx), the network outputs should be yk 2 [0; 1]
K
P
and sum to unity yk = 1. To ensure this, the softmax function may be used:
k=1
X
1
exp(ak )
where
A
= ak ; ln
exp(a` )
(11.44)
=
yk = P
k
K
1 + exp(;Ak )
`
=1
exp(a` )
`=k
`=1
and it may be seen that the activation of output neurons is the logistic function.
Let assume that the posterior probabilities of the outputs of hidden neurons is of the general
exponential form as (11.39)
6
h
p(zjCk ) = exp A(k ) + B (z; ') + Tk z
i
(11.45)
where A, B , k and ' have the same signi cance as in section 11.7.2.
From Bayes theorem, the posterior probability p(Ck jz) is
p(zjCk ) P (Ck )
p(Ck jz) = p(zjCpk()zP) (Ck ) = P
K
p(zjC` ) P (C` )
(11.46)
`=1
❖
ak , wk , wk0
exp(ak )
By substituting (11.45) in (11.46), the posterior probability becomes: p(Ck jz) = P
K
exp(a` )
`=1
where:
ak = wk z + wk0
T
and
(
wk = k
wk0 = A(k ) + ln P (Ck )
The derivative of error function with respect to the weighted sum of inputs ak of the output
neurons is:
K @E @y
@Ep = X
p `
@ak `=1 @y` @ak
(because Ep depends on ak trough all y`, see (11.44)).
From (11.43) respectively (11.44)
@Ep = ; t`p
@y`
y` (xp )
respectively
@y`
@ak
=
y``k ; y` yk
11.8. ENTROPY
207
p(x)
P
k
Figure 11.2:
x
k
A histogram. It may be seen as a set of \bins", each
containing some \objects".
and nally kp = y (x ) ; t , same result as for sum-of-squares error with linear activation
and two class with logistic activation cases.
@E
k
@a
➧ 11.8
p
k
Entropy
Let consider a random variable x and its probability density p(x). Also let consider a set of
P values fx g =1 . By dividing the X axis (x 2 X ) into \bins" of width (for each
\bin" k) and putting each p(x ) into the corresponding \bin", a histogram is obtained. See
gure 11.2.
Let consider the number of ways the objects from bins are arranged. Let P be the number
of \objects" from bin no. k. There are P ways to pick up the rst object, P ; 1 ways to
pick up the second one and so on, there are P ! ways to pick up the set of P objects. Also
there are P ! ways to arrange the objects in each bin. Because the number of arrangements
inside bins doesn't count then the total number of arrangements with will end up in the
same histogram is:
P!
W=Q
P!
p
p
k
;P
❖
❖
P
k
p
k
k
k
k
k
The particular way of arranging the objects in bins is called a microstate, while the arrangement with leads to the same p(x) is called macrostate. The parameter representing the
number of microstates for one given macrostate is named multiplicity.
✍
Remarks:
➥
11.8
The notion of entropy came from physics where it's a measure of the system
\disorder" (higher \disorder" meaning higher entropy). In the information theory
it is a measure of information content.
See [Bis95] pp. 240{245.
microstate
macrostate
multiplicity
208
CHAPTER 11. ERROR FUNCTIONS
There is an analogy which may be done between the above example and a physics
related example. Consider a gas formed of molecules with di erent speeds (just
one component of the speed will be considered).
From macroscopic (macrostate) point of view it doesn't matter which molecule
have a (particular) given speed while from microscopic point of view if two
molecules swap speeds there is.
➥ If the number of microstates decrease for the same macrostate then the order in
the system increases (till there is only one microstate corresponding to the given
macrostate in which case the system is totally ordered | there is only one way
to arrange it). As entropy measures the degree of \disorder" it shall be inversely
proportional to the multiplicity.
The results from statistical physics corroborated with thermodynamics lead to
the de nition of entropy as being equal (up to a multiplicative constant) with the
negative logarithm of multiplicity; the constant being the Boltzmann constant k.
Based on the above remarks, the entropy in the above situation is de ned as:
➥
"
#
X
= ; 1 ln ! ;
k!
k
P
and, by using the Stirling formula ln ! ' ln ; for 1 and the relation
k= ,
k
for
k ! 1, it becomes:
X k
X
ln k =
(11.47)
=;
k ln k
k
k
where k = k represents the | observed in histogram | probability density.
The lower entropy ( = 0) will be when all \objects" are in one single \bin", i.e. all
probabilities k = 0 with the exception of one ` = 1. The highest entropy will be when all
k are equal.
S
= ; 1 ln
P
W
P
P
P
P
P
P
P
P
P
P
P; P
P
S
❖
p
k
p
P
P
p
P
p
P =P
S
p
p
p
Proof. The proof of the above statement is obtained by using the Lagrange multiplier technique to nd the
maximum (see mathematical appendix) but the
P minimum of S will be missed due to discontinuities at 0.
The constraint is the condition of normation pk = 1; then the Lagrange function will be:
k
L=
X
k
pk ln pk +
"
X
k
and the maximum is found by zeroing the derivatives:
@L
@pk
❖
I
= ln pk + 1 + = 0
#
pk ; 1
@L
@
and
=
X
k
pk ; 1 = 0
then from the rst equation: pk = e;(1+) , 8k, and, considering K as the total number of \bins", by
introducing this value into the second one = ;1 + ln I . Then for pk = 1=I the entropy is maximum.
Considering the limit K ! 1 (number of bins) then the probability density is constant
inside a \bin" and pk = p(xp 2 bink ) k , where xp is from inside \bin" k. Then the
entropy is:
S
=
X
k
( p 2 bink ) k ln[ ( p ) k ] '
p x
Z
p x
( ) ln ( )
p x
X
p x dx
+ Klim
!1 ln k
11.8. ENTROPY
209
R
(because k = const. and p(x) dx = 1). For K ! 1 the with of \bins" will reduce
X
to k ! 0 and thus the term Klim
ln k ! 1 and, to keep a meaning, it is dropped
!1
from the expression of entropy (anyway the most important is the change of entropy S
and the divergent term disappears when calculating it; this is also the reason for the name
\di erential entropy").
De nition 11.8.1. The general expression of di erential entropy is:
S=;
Z
X
p(x) ln p(x) dx
The distribution with maximum of entropy, subject to the following constrains:
R1
;1
R1
;1
R1
;1
p(x) dx = 1, i.e. normalization of distribution.
xp(x) dx = , i.e. existence of a mean.
(x ; )2 p(x) dx = 2 , i.e. existence of a variance.
is the Gaussian:
p(x) =
2
p 1 exp ; (x 2;2)
2
The distribution is found trough the Lagrange multipliers method (see mathematical appendix). The
Lagrange function is:
Z1
Proof.
L=
;1
p(x)
ln p(x) + 1 + 2 x + 3 (x
; )2 dx ; 1 ; 2 x ; 3 (x ; )2
Then maximum of S is found by zeroing the derivatives of L:
Z1
@L
@p
; )2 dx = 0
(11.48)
;1
@L , @L , @L just lead to the imposed conditions. The derivative
and the other derivative equations @
1 @2 @3
(11.48) may be zero only if the integrand is zero | because the integrand is an odd function2 and the
integration is made over an interval symmetric relatively to the origin. Then the searched probability density
is of the form:
p(x) = exp ;1 ; 1 ; 2 x ; 3 (x ; )2
(11.49)
and the 1 , 2 , 3 constants are found from the conditions imposed.
From the rst condition:
Z1
2 dx
exp ;1 ; 1 ; 2 x ; 3 (x ; )
(11.50)
1=
;1
"
Z1
2 #
= exp
2
=
ln p(x) + 1 + 1 + 2 x + 3 (x
;1 ; 1 + 22 ; 2
A function f is odd if f (;x) = f (x).
3
;1
exp
;3
x;+
2
23
dx
di erential
entropy
210
CHAPTER 11. ERROR FUNCTIONS
r
;1 ; 1 + 22 ; 2
= exp
3
3
(see also the Gaussian integrals, mathematical appendix).
From the second condition, similar as above, and using the change of variable xe = x ; (one of the
integrals will cancel due to the fact that the integrand will be an odd function and the integration interval
is symmetric relatively to origin):
Z1
=
;1
x exp
= exp
;1 ; 1 ; 2 x ; 3 (x ; )2 dx
3
= exp
1
Z
;1 ; 1 + 22 ; 2
;1
;1 ; 1 + 22 ; 2
3
"
x exp
;
2
23
;3
1
Z
;1
x;+
2
23
"
x exp
;3
2 #
dx
x;+
2
23
2 #
and by using (11.50) (see also Gaussian integrals) if follows that = ; 223 , thus 2
q
back into (11.50) gives exp (1 + 1 ) = 3 .
dx
. Replacing
= 0
From the third condition, as 2 = 0 (integration by parts):
Z1
2
=
=
;1
(x
; )2 exp ;1 ; 1 ; 3 (x ; )2 dx
;1;1 Z1
; e 2
and then nally 3 = 21 2
values back into (11.49).
3
(x
; ) d exp ;3 (x ; )2
=
;1;1 r
; e 2
3
3
;1
and e;1;1 = p21 . The wanted distribution is found by replacing the 1;2;3
Another way of interpreting the entropy is to consider it as the amount of information. It
is reasonable to consider that the amount of information and the probability are somehow
interdependent, i.e. S = S (p).
The entropy is a continuous, monotonically increasing function. On the other hand an
certain event have probability p = 1 and bears no information S (1) = 0.
Let now consider two events A and B , statistically independent and having probabilities pA
and pB . The probability of both events occurring is pApB and the entropy associated is
S (pA pB ). If event A have occurred then the information given by event B is S (pA pB ) ;
S (pA ) and on the other hand S (pB ). Then:
S (pA pB ) = S (pA) + S (pB )
From the above result it follows that S (p2 ) = 2S (p) and, by induction, S (pn ) = nS (p).
Then, by the same means S (p) = S f(p1=n)n g = nS (p1=n ) which may be immediately
extended to S (pn=m ) = mn S (p). Finally, from the continuity of S :
S (px ) = xS (p)
and thus it also may be written as:
S (p) = S f(1=2); log2 p g = ;S (1=2) log2 p
11.8. ENTROPY
211
where S (1=2) is a multiplicative constant and by choosing it to be equal to 1 the entropy/information is being said to be expressed in bits. By choosing natural logarithm
S (p) = ;S (1=e) ln p and S (1=e) = 1 the information will be expressed in nats.
✍
Remarks:
The above result may be explained by the fact that, for independent events, the
information is additive (there may be information about event A and event B )
while the probability is multiplicative.
Let consider a discrete random variable x which may take one of the values fxp g and have
to be transmitted, or rather information about it. For each xp the information is ; ln p(xp ).
Then the expected/average information/entropy to be transmitted (as to establish what of
the possible xp values x have now) is:
➥
S=;
e
e
S=
Remarks:
➥
Z
; p x pe x dx
(11.51)
( ) ln ( )
X
Z
Z
p
e
x
; p x p x dx ; p x
The entropy may be written as:
S=
( ) ln
( )
( )
p
( ) ln (x)
dx
X
X
where the rst term represents the asymmetric divergence between p(x) and
p(x)3 . It is shown in the \Pattern Recognition" chapter that S (x) is minimum
for p(x) = p(x).
For a neural network which have the outputs yk the entropy (11.51), for one xp is:
K
Sp = ; tk ln yk (xp )
k=1
because tk represents the true probability (desired output) while yk (x) represents the estimated (calculated) probability (actual output). For the whole training set the crossentropy is:
P K
S=;
tkp ln yk (xp )
p=1 k=1
the result being valid for all networks for which tkp and yk represents probabilities.
ee
X
XX
3 Known also as Kullback-Leibler distance.
noiseless coding
theorem
X p xp
( ) ln p(xp )
p
the result being known also as noiseless coding theorem.
For a continuous vectorial variable x, and also
considering that usually the true distribution p(x) is not known but rather the estimated one
p(x) (the associated information being ; ln p(x)). Then the average/expected information
relative to x is:
✍
bits, nats
❖
e
p(x)
212
CHAPTER 11. ERROR FUNCTIONS
➧ 11.9
Outputs as Probabilities
Due to the fact that many error functions lead to the interpretation of network output as
probabilities, the problem is to nd the general conditions which lead to this result.
It will be assumed that the error function is additive with respect to the number of training
P
P
vectors E = E and each E is a sum over the components of the target and actual
=1
output vectors:
p
p
p
Ep =
K
X
k
❖f
f (tkp ; yk (xp ))
=1
where f is a function to be found, assumed to be of the form:
j
f (tkp ; yk (xp )) = f ( yk (xp )
❖ hE p i
; t j)
kp
i.e. it depends only on the distance between actual and desired output.
Considering an in nite training set, the expected (average) per-pattern error is:
K ZZ
X
hE i =
p
k
=1 X Y
j
; t j) p(tjx) p(x) dt dx
f ( yk (x)
k
For the one-of-k encoding scheme, and considering statistically independent
outputs), the conditional probability of the target may be written as:
j
p(t x) =
" K
K
Y
X
=1
`
q
D (t`
=1
;
`q )
P(
C jx)
#
=
q
K
Y
j
p(t` x)
=1
k
=1 X
=1 Y
=s `
`
`
(network
(11.52)
`
and then the expectation of the error function becomes:
2
K Z
K Z
X
Y
6
hEp i =
4
yk
32
7
p(t` jx) dt` 5 4
Z
j
f ( yk (x)
3
; t j) p(t jx) dt
k
k
k
5 p(x) dx
Yk
6
where the integrals over Y = are equal to 1 due to the normalization of the probability
(or, otherwise, by substituting the value of p(t jx) and doing the integral) and the p(t jx)
probability density may be written as (see (11.52) | the cases k = q and k 6= q were
considered):
`;`
6
k
`
j
p(tk x) = D (tk
; 1) P (C jx) +
k
K
X
k
C jx) =
D (tk ) P (
=1
=k
q
q
q
6
= D (tk
(because
11.9
K
P
q
=1
P(
; 1) P (C jx) +
C jx) = 1, normalization).
q
See [Bis95] pp. 245{247.
k
t
D( k)
[1
; P (C jx)]
k
11.9. OUTPUTS AS PROBABILITIES
213
Finally the average error becomes:
K Z (Z
X
hEp i =
f (jyk (x) ; tk j) D (tk ; 1) P (Ck jx) dtk
k=1 X Yk
Z
+
=
K Z (Z
X
k=1 X
Yk
+
=
K Z
X
k=1 X
Yk
j
f ( yk (x)
f (1
Z
Yk
; tk j) D (tk ) [1 ; P (Ck jx)] dtk
)
p(x) dx
; yk ) D (0) P (Ck jx) dtk
f (yk ) D (0) [1
; P (Ck jx)] dtk
)
p(x) dx
ff (1 ; yk ) P (Ck jx) + f (yk ) [1 ; P (Ck jx)]g p(x) dx
because the rst integral over Yk is for the case tk = 1 while the second is for the case
tk = 0; obviously yk 2 [0; 1].
The conditions of minima for hEp i are found by setting its functional derivative, with respect
to yk (x), to zero:
h i
Ep
yk (x)
= f (1
0
; yk ) P (Ck jx) + f (yk ) [1 ; P (Ck jx)] = 0
0
where f is the derivative of f . This leads to:
❖ f0
0
f 0 (1
; yk ) = 1 ; P (Ck jx)
P (Ck jx)
f 0 (yk )
and, considering that the network outputs are probabilities (this was the hypothesis):
f 0 (1
; yk ) = 1 ; yk
f (yk )
0
(11.53)
yk
A general class of functions which satis es (11.53) is:
f (y ) =
Z
y r (1
; y)r dy
;
r = const.
and from this class (f (1 ; y) = df (1 ; y)=d(1 ; y)):
r = 1 ) f (y) = y2 =2, i.e. the sum-of-squares error function.
r = 0 ) f (y) = ; ln(1 ; y) = ; ln(1 ; jyj), i.e. the cross-entropy error function.
0
✍
Remarks:
➥
The Minkowski error function f (y) = yR does not satis es the (11.53) unless
R = 2 which leads to the, already known, sum-of-squares error.
214
CHAPTER 11. ERROR FUNCTIONS
In this case the outputs of network are:
P (Ck jx)1=(R;1)
yk =
P (Ck jx)1=(R;1) + [1 ; P (Ck jx)]1=(R;1)
On the other hand the decision boundaries still give the minimum of misclassi cation because yk are monotonic functions of P (Ck jx).
CHAPTER
12
Parameter Optimization
➧ 12.1
Error Surfaces
Generally the error may be represented as a surface E = E (W )1 into the NW + 1 space
where NW is the total number of weights (see also gure 12.2 on page 219).
The goal is to nd the minima of error function, where rE = 0; however note that
this condition is not enough to nd the absolute minima because it is also true for local
minimums, maximums and saddle-points.
In general it is not possible to nd the solution W in a closed form. Then a numerical
approach is taken, to nd it by searching the weights space in incremental steps (t = 1; : : : )
of the form W(t+1) = W(t) + W(t) . However, usually, the algorithms does not guarantee
for the nding of absolute minima and even an saddle-point may stuck them.
On the other hand the weight space have a high degree of symmetry2 and thus many local
and global minimums which give the same value for the error function; then a relatively fast
convergence may be achieved starting from a random point (i.e. the local/global minima
will be relatively close wherever the starting point is).
It was shown3 that the optimum value for the network is obtained when hyk jxi = htk jxi,
i.e. the actual average output yk equals the desired output tk , both conditioned on input
x. However this equality was obtained by considering the training set as in nite while in
practice the training set is always nite and covers just a tiny amount from all possible
cases. Then even a local minima may give a good generalization.
See [Bis95] pp. 253{256.
Such a surface was drawn in the \Backpropagation" chapter for 2 weights.
2
See the \Multi Layer Neural Networks" chapter.
3
See the \Error Functions" chapter.
12.1
1
215
❖
N
W
216
CHAPTER 12. PARAMETER OPTIMIZATION
➧ 12.2
Local Quadratic Approximation
Let consider the Taylor series development of error E around a point W0 , note that here
W is being seen as a vector :
E (W ) = E (W0 ) + rE jW0 +
❖
H
1
T
2 (W ; W0 ) H jW0 (W ; W0 )
(12.1)
where fH gij = @w@E@w is the Hessian matrix, note that here the Hessian will be considered
as a matrix. Then the gradient of E with respect to W may be approximated as:
i
j
rE = rE jW0 + H jW0 (W ; W0 )
❖
W
Considering a minima point W then rE jW = 0 and:
E (W ) = E (W ) +
❖ ui , i
1 (W ; W )T H jW (W ; W )
2
(12.2)
Because the Hessian is a symmetric matrix then it is possible to nd an orthonormal set of
eigenvectors4 fui g with the corresponding eigenvalues fi g such that:
Hui = i ui ; uTi uj = ij
(12.3)
(i and ui being calculated at the minimum of E , given by the point W ).
By writing (W ; W ) in terms of fui g
W ; W =
❖ i
(where
i
X
i
(12.4)
i ui
are the coecients of the fui g development) and replacing in (12.2)
E (W ) = E (W ) +
1
2
X
i
2
i i
i.e. the error hyper-surfaces projected into the weights space are hyper-ellipsoids oriented
with the axes parallel with the orthonormated fui g system and having the lengths inversely
proportional with the square roots of Hessian eigenvalues. See gure 12.1 on the facing
page.
✍
Remarks:
➥
From (12.2):
E = E (W ) ; E (W ) = 12 (W ; W )T H jW (W ; W )
then W is a minimum point if only if H is positive de nite5 ( ,
E (W ) > E (W )).
12.2
4
5
See [Bis95] pp. 257{259.
See mathematical appendix.
See previous footnote.
E > 0 ,
12.3. INITIALIZATION AND TERMINATION OF LEARNING
w2
;1=2
1
u2
2;1=2
u1
W
E = const.
w1
Figure 12.1:
➥
➥
➧ 12.3
A constant error surface projection into a bidimensional
weights space is an ellipse having the axes parallel with
the Hessian eigenvectors and lengths inversely proportional with the square root of Hessian eigen values.
The quadratic approximation of error function (12.1) involves the knowledge of
W (W + 3)=2 terms (W for rE and W (W + 1)=2 for the symmetrical Hessian
matrix). So, to nd a minimum will require at least O(W 2 ) equations, each
needing at least O(W ) steps, i.e. a total of O(W 3 ) steps.
By using the gradient information and the backpropagation algorithm the required
steps are reduced to O(W 2 ).
It was shown6 that for linear output neurons it is possible to nd the related
weights by one step.
Initialization and Termination of Learning
Usually the weights are initialized with random values to avoid problems due to weight space
symmetry. However there are two restrictions:
If the initial weights are too big then the the activation functions f will have values
into the saturation region (e.g. see sigmoidal activation function) and their derivatives
f 0 will be small, leading to an small error gradient as well, i.e. a approximatively at
error surface and, consequently, a slow learning.
If the initial weights are too small then the activation functions f will be linear and
their derivatives will be quasi-constant, the second derivatives will be small and then
the Hessian will be small meaning that around minimums the error surface will be
approximatively at and, consequently, a slow learning (see section 12.2).
which suggest that the weighted sum of inputs, for sigmoidal activation function, should be
of order unity.
6
12.3
See chapter \Single Layer Neural Networks".
See [Bis95] pp. 260{263.
217
218
CHAPTER 12. PARAMETER OPTIMIZATION
The weights are usually drawn from a symmetrical Gaussian distribution with 0 mean (there
is no reason to choose any other mean, due to symmetry). To see the choice for the variance
of the above distribution, let consider a set of inputs fx g =1 with zero mean: hx i = 0
and unit variance: hx2 i = 1 (calculated over the training set).
The output of neuron is:
i
i
i
;N
i
= f (a)
z
where
=
a
X
N
i
wi x i
=0
(for i = 0 the weight w0 represents the bias and x0 = 1).
The weights are chosen randomly, independent of fx g, and then the mean of a is:
i
hai
=
X
N
hwi xi i
i
=0
X
N
=
hwi ihxi i
=0
=0
i
= 0) and, considering fw g statistically independent hw w i = 2, the vari* X ! X !+ X
= hw2 ihx2 i = (N + 1)2 ' N 2
w x
wx
ha2 i =
(as hx i
ance is:
i
i
i
j
i
=0
j
j
ij
N
N
N
i
i
j
=0
i
i
=0
i
(for non-bias the sum is from i = 1 to N and then ha2 i = N 2 ). Because ha2 i ' 1 (as
discussed above) then the variance should be chosen ' N ;1 2 .
Another way to improve network performance is to train multiple instances of the same
network, but with a di erent set of initial weights, and choosing among those who give best
results. This method is called committee of networks.
The criteria for stopping the learning process may be one of the following:
Stop after a xed number of steps.
Stop when the error function had become smaller than a speci ed amount.
Stop when the change in the error function (E ) had become smaller than a speci ed
amount.
Stop when the error on an (independent) validation set begin to increase.
=
committee of
networks
➧ 12.4
Gradient Descent
A simple, geometrical, explanation of the gradient descent method is the means of representation of the error surface into the weights space. See gure 12.2 on the next page.
The gradient of error function, relative to the weights, rE will point to the set of weights
which will give maximum error. Then the weights have to be moved against the direction of
rE ; note that this does not mean a movement towards the minima point. See gure 12.2
on the facing page.
12.4
See [Bis95] pp. 263{272.
12.4. GRADIENT DESCENT
219
E
max
rE
(t)
w1
w2
min
(t + 2)
(t + 1)
rE
Figure 12.2:
The error surface for a bidimensional weights space.
The steps t, t +1 and t +2 are also shown. The weights
are moved against the error maximum (and not towards
minimum).
At step t = 0 (in discrete time approximation) the weights are initialized with the value
W0 (usually randomly selected). Then at some step t + 1 they are adjusted following the
formula:
(12.5)
W(t+1) = W(t+1) ; W(t) = ;rE jW t
( )
where is a parameter governing the speed of learning, named learning rate/constant,
controlling the distance between W(t+1) and W(t) (see also gure 12.2). This type of
learning is named a batch process (it uses all training vectors at once every time the gradient
is calculated). (12.5) is also known as delta rule.
Alternatively the same method (as above) may be used but with one pattern at a time:
W t
( +1)
= W t ; W t = ;rEp jW
( +1)
( )
(t)
where p denotes a pattern from the training set, e.g. the training patterns may be numbered
in some way and then considered in the order p = t ( rst pattern taken at rst step, : : : ,
and so on). This type of learning is named sequential process (it uses just one training
vector at a time).
✍
7 See
Remarks:
➥
The gradient descent technique is very similar to the Robbins-Monro algorithm7
and it becomes identical if the learning rate is chosen of the form / 1=t and thus
chapter \Pattern recognition".
learning rate
batch learning
sequential learning
220
CHAPTER 12. PARAMETER OPTIMIZATION
➥
➥
➥
the convergence is assured; however this alternative leads to very long training
time.
The choice of may be critical to the learning process. A large value may lead
to an over shooting of the minima especially if it's narrow and step (in terms of
error surface) and/or oscillations between 2 areas (points) in weights space. A
small value will lead to long learning time.
Compared to the batch learning, the sequential process is less sensitive to training
data redundancy (a duplicate of a training vector will be used twice, in two
separate steps, and thus, usually, improving learning, rather than being used
twice at each learning step).
It is also possible to use a mixture learning, i.e. to divide the training set into
subsets and use each subset for batch process. This technique is especially useful
if the training algorithm is intrinsically of the batch type.
12.4.1 Learning Parameter and Convergence
Considering the quadratic approximation of the error function in the neighborhood of minima, from (12.2), (12.3) and (12.4):
rE =
From (12.4):
W(t+1) =
X
i
i ui
X
i
(12.6)
i i ui
where i =
i(t+1)
;
i(t)
(12.7)
and by replacing (12.6) and (12.7) in (12.5) and because the eigenvectors ui are orthonormated, then:
i = ;i i(t) ) i(t+1) = (1 ; i ) i(t)
(12.8)
Again from (12.4), by multiplying with uTi to the left, and from the orthonormation of fui g:
uTi (W ; W ) =
i
i.e. i represents the distance (into the weights space) to the minimum, along the direction
given by ui .
After T steps the iterative usage of (12.8) gives:
i(T )
❖ max
= (1 ; i )T
i(0)
(12.9)
A convergence process means that tlim
W(t) = W , i.e. tlim
i(t) = 0 (by considering the
!1
!1
signi cance of i as discussed above). Then, to have a convergent learning, formula (12.9)
shows that it is necessary to impose the condition:
j1 ; i j < 1 ; 8i ) < 2
max
where max is the biggest eigenvalue of the Hessian.
12.4. GRADIENT DESCENT
221
;=
max
w2
1 2
u2
u1
;=
min
1 2
;rE
E = const.
w1
Figure 12.3:
Slow learning for a small min =max ratio. The longest
axis of the E = const. ellipsoid surface is proportional
1=2
while the shortest one is proportional with
with ;min
;1=2 . The gradient is perpendicular to the E = const.
max
surface. The weight vector is moving slowly, with oscillations towards the minima of E , center of the ellipsoids.
Considering the maximum possible value = 2=max, and replacing into (12.8), the speed
of learning will be decided by the time needed for the convergence
of i corresponding to
2min
the smallest eigenvalue, i.e. the size of factor 1 ; max . If the ratio min =max is very
small then the convergence will be slow. See gure 12.3.
So far, in equation (12.5), the time, given by t, was considered discrete. By considering
continuous time the equation governing weights change during learning becomes:
dW
= ;rE
dt
which represents the equation of movement for a point in weights space, which position is
given by W and subject to a potential eld E (W ) and viscosity 1=.
12.4.2 Momentum
By adding a term to the equation (12.5) and changing it to:
W t = ;rE jW + W t
( +1)
(12.10)
( )
( t)
it is possible to increase the speed of convergence. The parameter is named momentum.
(12.10) may be rewritten into a continuous form as follows:
W(t+1) ; W(t) = ;rE jW(t) + W(t) ; W(t;1)
❖
222
CHAPTER 12. PARAMETER OPTIMIZATION
E
E
E
W 0
W 00
W 000
W
W
W
W 0 W 00 W 000
W 0 W 00 W 000
a)
W 00
b)
Figure 12.4:
W 000 W 0
c)
Learning with and without momentum (W 0 , W 00 and
W 000 are 3 points, consecutive in time). Figure a) shows
learning without momentum: W decreases in proportion with rE , i.e. decreases around minima. Figure b)
shows learning with momentum (without oscillations):
the learning rate increases, compensating for the decrease of rE . Figure c) shows oscillations; most of the
additional quantities, introduced by momentum, cancels
from one oscillation to the next.
W(t+1) ; W(t) = ;rE jW + W(t) ; W(t;1)
(t)
( +1)
( +1)
W , W
❖
2
( +1)
( +1)
( )
( )
( +1)
(t)
( +1)
❖
(t)
2
( +1)
2
( )
( +1)
( +1)
( +1)
( )
By switching to the limit, the terms in (12.11) have to be of the same in nitesimal order,
let be equal to the unit of time (introduced for dimensionality correctness), then
(1 ; )dW dt = ;rE dt ; d W
2
m,
( +1)
( )
2
❖
; W t ;W t + W t ;W t
W t = ;rE jW ; W t ; W t + W t
(12.11)
(1 ; )W t = ;rE jW ; W t
where W t = W t ; W t and W t = W t ; W t .
2
which gives nally the di erential equation:
2
2 ; = (1 ; )
=
;r
E
;
where m =
m ddtW2 + dW
dt
pointing that the momentum shall be chosen 2 (0; 1).
✍
Remarks:
➥
(12.12)
(12.12) represents the equation of movement of a particle into the weights space
W , having \mass" m, subject to friction (viscosity) proportional with the speed,
de ned by and into a conservative force eld E .
12.4. GRADIENT DESCENT
223
u1
u2
E
Figure 12.5:
= const.
The learning, alongside Hessian eigenvector corresponding to smallest eigenvalue, is accelerated comparing with
plain gradient descent. See also gure 12.3 on page 221
for comparison.
2
d W
represents the position, dW
dt represents the speed, dt2 is the acceleration,
nally E and ;rE are the potential, respectively the force of the conservative
elds.
W
To understand the e ect given by the momentum | see also gure 12.4 on the preceding
page | two cases may be analyzed:
The gradient is constant rE = const. Then, by applying iteratively (12.10):
W = ;rE (1 + + 2 + : : : ) ' ; 1 ; rE
n
(because 2 (0; 1) and then nlim
!1 = 0), i.e. the learning rate e ectively increases
from to (1; ) .
In a region with high curvature where the gradient change direction almost completely
(opposite direction), generating weights oscillations, the e ects of momentum will
tend to cancel from one oscillation to the next.
The above discussion is also true on the components of vector W . The advancement in
the direction of error minima alongside the direction of eigenvector corresponding to the
smallest eigenvalue of Hessian is accelerated. See gure 12.5.
12.4.3 Other Gradient Descent Improvement Techniques
Bold descent
This algorithm improvement is based on the following idea:
If, after a step, the error increases, i.e. E > 0 this means that the minimum point
was overshot. Then the change is rejected (reversed), because otherwise the weight
value will be further from the minima point, and the learning rate is decreased.
If, after a step, the error decreases, i.e. E < 0 then the weight adjustment is
accepted, but the learning rate is considered to be too small and consequently is
increased.
224
CHAPTER 12. PARAMETER OPTIMIZATION
The algorithm changes the learning rate, at each step, as follows:
(
1
1
and the weight change is rejected if E > 0. In practice ' 1:1 and ' 0:5.
Quick backpropagation
(t+1)
❖ a, b, c
=
if E < 0 ;
if E > 0 ;
(t)
(t)
>
<
The quick backpropagation algorithm makes the following assumptions:
The weights are independent. This is true only if the Hessian would be diagonal, i.e.
its eigenvectors are parallel with the weights axes which in practice is seldom true (see
also gure 12.3 on page 221 | the system fui g is usually not parallel with the W
system).
The error function may be approximated with a quadratic polynomial, i.e. a parabola
E (wi ) = a + bwi + cwi2 , a; b; c = const.
and then the weights are changed to the minima of error function by using 2 estimates of
the error gradient.
Assuming the above hypotheses then:
( ) = a + bwi + cwi )
2
E wi
)
8
>
<
@E
@wi t
@E
>
:
@wi t;1
= b + 2cwi t
= b + 2cwi t;
= b + 2cwi
) 2c =
( )
(
@E
@wi
1)
@E
@wi t
wi(t)
@E
; @w
t;
; wi t;
i
(
1
1)
At t + 1 the minimum is attended (as assumed):
minimum )
@E
@wi t
=0 )
wi(t+1)
= ; 2bc
From the error gradient at t:
b
=
@wi t ; @wi t;1
@E
;
wi(t)
@wi t
wi(t) ; wi(t;1)
@E
@E
and then the weights update formula is:
wi t
( +1)
@E
@wi t
@E
@wi t
@wi t;1
= ; @E
;
wi t
( )
(wi(t+1) = wi(t+1) ; wi(t) , wi(t) = wi(t) ; wi(t;1) ).
✍
Remarks:
➥
The algorithm assumes that the parabola have a minimum. If it have a maximum
then the weights will be updated in the wrong direction.
12.5. LINE SEARCH
➥
➧ 12.5
225
Into a region of nearly at error surface it is necessary to limit the size of the
weights update, otherwise the algorithm may jump over minima, especially if it
is narrow.
Line Search
The main idea of the algorithm is to search for the minimum of error function along the
direction given by its negative gradient, instead of just move by a xed amount given by
the learning rate.
The weight vector at step t + 1, given the value at step t, is:
W(t+1) = W(t) ; (t) rE jW(t)
;
where (t) is a parameter, such that E () = E W(t) ; rE jW t is minimum.
By the above approach the problem is reduced to a unidimensional case. The minimum of
E () may be found by searching for 3 points W 0 , W 00 and W 000 such that E (W 0 ) > E (W 00 )
and E (W 000 ) > E (W 00 ). Then the minimum is to be found somewhere between W 0 and
W 000 and may be found by approximation E with a quadratic polynomial, i.e. a parabola.
( )
✍
Remarks:
➥
➧ 12.6
12.6.1
Another, less ecient but simpler, way would be to advance, along the direction
given by ;rE , in small, xed steps, and stop when error begin to increase.
Conjugate Gradients
Conjugate Search Directions
The method described in section 12.5 does not give the best direction for search of the
minima in the weight space. Considering that the line search minimum have been reached
at step t + 1 then:
@E
@ W(t+1)
=
@E
=0
@ W(t) ;(t) rE jW
(t)
)
;
rE jW t
( +1)
T ;
rE jW t
( )
=0
i.e. the gradients from two consecutive steps are orthogonal. See gure 12.6 on the following
page. This behavior may lead to more (searching) steps than necessary.
It is possible to continue the search from step t + 1, beyond, such that the component of
the gradient parallel with the previous direction | and made 0 by the previous minimization
step | is preserved to the lowest order. This means that a series of directions fd(t) g have
to be found such that:
;
;
rE jW t T d(t) = 0 and rE jW t +d t T d(t) = 0
( +1)
12.5
See [Bis95] pp. 263{272.
12.6
See [Bis95] pp. 274{285.
( +1)
( +1)
❖ (t)
226
CHAPTER 12. PARAMETER OPTIMIZATION
w2
rE j
rE j
t
W( +1)
t
W( )
rE j
Figure 12.6:
W(
surfaces E = const.
t;1)
w1
Line search learning. The error gradients from two consecutive steps are perpendicular.
and, by developing in series to the lowest order:
h;
rE j
T
t
W( +1)
T
T
d(t+1)
conjugate
directions
i
+ d(t+1) H d(t) = 0
)
(12.13)
H d(t) = 0
where H is the Hessian calculated at point W +1 . Directions which respects condition
(12.13) are named conjugate (or interfering ).
t
12.6.2
Quadratic Error Function
A quadratic error function is of the form:
E (W ) = E0 + bT W
+
1
2
W T HW ;
b
;H
=
const.
and H (the Hessian) is symmetrical and positive de nite.
The error gradient is:
rE
❖ W
(12.14)
and the minimum of E is achieved at the point W where:
rE j
W
❖ NW
= b + HW
=0
)
b+
HW = 0
(12.15)
Let consider a set of fd g =1 W conjugate | with respect to H | directions (N being
the total number of weights):
i
and, of course, d
i
6
i
, 8i.
=0
Proposition 12.6.1. The
W
;N
T
di
f g
di
H dj
W
i=1;N
=0
for i 6= j
(12.16)
set of conjugate directions is linearly independent.
12.6. CONJUGATE GRADIENTS
227
Let consider that there is a direction d which is a linear combination of the other ones:
X
d =
d ;
= const.
=
and then, from the way the set fd g was chosen:
1
0
X AT
T̀
@
d
Hd = = dT = Hd = = 0
d Hd = =
=
and as dT = Hd = 6= 0 then 8 = = 0, i.e. d = 0 which runs counter the way fd g set was
chosen.
Proof.
`
i
`
i;i
6
i
i
`
i
q;q
6
i
`
i;i
q;q
✍
6
`
q;q
6
`
q;q
i
q;q
6
`
q
q;q
6
`
q;q
6
`
`
6
6
`
i
`
Remarks:
➥
The above proposition ensures the existence of a set of NW linear independent
and H conjugate vectors.
As fdi g contains NW vectors and is linear independent then it may be used as a reference
system, i.e. any W vector from the weights space may be written as a linear combination
of fdi g.
Let assume that the starting point for the search is W1 and the minimum point is W .
Then it may be possible to write:
NW
(12.17)
W ; W1 =
i di
i=1
X
where i are some parameters. Finding W may be envisaged by a successive steps (of
length i along directions di ) in the form:
Wi+1 = Wi + i di
(12.18)
where i = 1; NW and WNW +1 = W .
By multiplying (12.17) with dT̀ H to the left and using (12.15) plus the property (12.16):
NW
dT` H (W ; W1 ) = ;dT` b ; dT` HW1 =
i dT` Hdi = ` dT` Hd`
i=1
X
and then the ` steps are:
dT (b + HW1 )
` = ; ` dT Hd
` `
The ` coecients may be put into another form. From (12.18): Wi+1 = W1 +
multiplying with dTi+1 H to the left and using again (12.16):
dTi+1 HWi+1 = dTi+1 HW1 )
dTi+1 rE jWi+1 = dTi+1 (b + HWi+1 ) = dTi+1 (b + HW1 )
(12.19)
Pi
j=1
j dj ;
228
CHAPTER 12. PARAMETER OPTIMIZATION
and by using this result in (12.19):
dT̀ rE j
` = ; dT̀ HdW` `
(12.20)
Proposition 12.6.2. If the weight vector is updated according to the procedure (12.18)
the the gradient of the error function at step i + 1 is orthogonal on all previous conjugate
directions:
(12.21)
dTj rE jWi = 0 ; 8j; i such that j < i 6 NW
Proof.
From (12.14) and (12.18)
rE jWi+1 ; rE jWi = H (Wi+1 ; Wi ) = i Hdi
(12.22)
and, by multiplying to the left with dTi and replacing i from (12.20), it follows that (fdi g are conjugate
directions):
dTi rE jWi+1
(12.23)
=0
On the other hand, by multiplying (12.22) with dj to the left:
dTj rE jWi+1 ; rE jWi = i dTj Hdi = 0 ; 8j < i 6 NW
(12.24)
The equation (12.24) is written for all instances i; i ; 1; : : : ; j + 1 and then a summation is done over all
equation obtained, which gives:
dTj rE jW ; rE jW
=0 ;
8j < i 6 NW
i
and, by using (12.23), nally:
+1
j
dTj rE jWi+1
+1
=0
;
8j < i 6 NW
which combined with (12.23) (i.e. for j = i) proves the desired result.
The set of conjugate directions fdi g may be built as follows:
1. The rst direction is chosen as:
d1 = ;rE jW1
2. The following directions are build incrementally as:
di+1 = ;rE jWi+1 + i di
❖
i
(12.25)
where i are coecients to be found such that the newly build di+1 is conjugate with
the previous di , i.e. dTi+1 Hdi = 0. By multiplying (12.25) with Hdi to the right:
;rE jW +1 + i di T Hdi = 0 )
;
i
i=
rE jW +1 )T Hdi
dTi Hdi
(
i
(12.26)
Proposition 12.6.3. By using the above method for building the set of directions, the error
gradient at step j is orthogonal on all previous ones:
;
rE jW
j
T
rE jW
i
=0
;
8j; i such that j < i 6 NW
(12.27)
Obviously by the way of building, each direction vector represents a linear combination of all previously gradients, of the form:
jX
;1
dj = ;rE jWj +
(12.28)
` r E jW `
`=1
Proof.
12.6. CONJUGATE GRADIENTS
229
where ` are the coecients of the linear combination (their exact value is not relevant to the proof).
By multiplying (12.28) with rw E jwi to the right and using the result established in (12.21):
jX
;1 ;
T
rE jWj rE jWi =
` rE jW` T rE jWi ; 8j; i such that j < i 6 NW
`=1
For j = 1 the error gradient equals the direction d1 and, by using (12.21), the result (12.27) holds. For
j = 2:
jX
;1 ;
;
;
rE jW2 T rE jWi =
` rE jW` T rE jWi = 1 rE jW1 T rE jWi = 0
`=1
and so on , as long as j < i and thus the (12.27) is true.
Proposition 12.6.4. The (set of) directions build by the above method are mutually conjugate.
It will be shown by induction.
By construction d2 Hd1 = 0, i.e. these directions are conjugate.
It is assumed (induction) that:
dTi Hdj = 0 ; 8i; j such that j < i 6 NW
is true and it have to be shown that it holds for i + 1 as well (assuming i + 1 6 NW , of course).
For di+1 , by using the above assumption and (12.25):
T
T
dTi+1 Hdj = ; rE jWi+1 Hdj + i dTi Hdj = ; rE jWi+1 Hdj
(12.29)
Proof.
8j; i such that j < i 6 NW (the second term disappears due to the induction assumption supposed true).
T
j Hdj . By multiplying this equation with rE jWi+1 to the left,
T
and considering (12.27), i.e. rE jWi+1 rE jWj = 0 for 8j < i + 1 < NW , then:
From (12.22), rE jWj+1 ;rE jWj =
rE jWi+1 T rE jWj+1 ; rE jWj
T
T
= rE jWi+1 rE jWj+1 ; rE jWi+1 rE jWj = j rE jWi+1 Hdj = 0
and by inserting this result into (12.29) then:
dTi+1 Hdj = 0 ; 8j; i such that j < i 6 NW
and this result is extensible from j < i 6 NW to j < i + 1 6 NW because of the way di+1 is build, i.e.
di+1 Hdi = 0 by design.
✍
Remarks:
➥
The method described in this section gives a very fast converging method for nding the error minima, i.e. the number of steps required equals the dimensionality
of the weight space. See gure 12.7 on the next page.
12.6.3 The Algorithm
The previous section give the general method for fast nding the minima of E . However
there are 2 remarks to be made:
The error function was assumed to be quadratic.
❖
`
230
CHAPTER 12. PARAMETER OPTIMIZATION
u1
u2
E = const.
Figure 12.7:
Conjugate gradient learning. Into a bidimensional
weight space it takes just 2 steps to reach the minimum.
For a non-quadratic error function the Hessian is variable and then it have to be
calculated at each Wi point which results into a very computational intensive process.
For the general algorithm it is possible to express the i and i coecients without explicit
Hestenes-Stiefel
calculation of Hessian. Also while in practice the error function is not quadratic the conjugate
gradient algorithm still gives a good way of nding the error minimum point.
There are several ways to express the i coecients:
The Hestenes-Stiefel formula. By replacing (12.22) into (12.26):
i=
Polak-Ribiere
;
;
rE jW +1; T rE jW +1 ; rE jW
T
di rE jW +1 ; rE jW
i
(12.30)
i
i
i
i
The Polak-Ribiere formula. From (12.25) and (12.21) and by making a multiplication
to the right:
i;1 rE jW = 0 )
T
T
T
di rE jW = ; (rE jW ) rE jW + i di;1 rE jW
i ;rE jW
d =
i
+
i di;1
;
d
T
=
i
i
i
i
i
; (rE jW )T rE jW
i
i
and by using this result, together with (12.21) again, into (12.30), nally:
i=
Fletcher-Reeves
;
;
rE jW +1 T rE jW +1 ; rE jW
T
(rE jW ) rE jW
i
i
i
i
The Fletcher-Reeves formula. From (12.31) and using (12.27):
;
r
E jW +1 T rE jW +1
i = (rE j )T rE j
W
W
i
i
i
✍
Remarks:
➥
(12.31)
i
(12.32)
i
In theory, i.e. for a quadratic error function, (12.30), (12.31) and (12.32) are
equivalent. In practice, because the error function is seldom quadratic, they may
gives di erent results. Usually the Polak-Ribiere formula gives best result.
12.6. CONJUGATE GRADIENTS
231
Let consider a quadratic error as function of i :
1
E (Wi + i di ) = E0 + bT (Wi + i di ) + (Wi + i di )T H (Wi + i di )
2
The minimum if error along the direction given by di is found by imposing the cancellation
of its derivative with respect to i :
@E
=0
) bT di + (W + i di )T Hdi = 0
@ i
and considering the property xT y = yT x and the fact that rE = b + HW , then:
dT rE j
i = ; di T HdWi
i
i
(12.33)
The fact that formula (12.33) coincide with expression (12.20) indicate that the procedure
of nding these coecients may be replaced with any procedure for nding the error minima
along di direction.
The general algorithm is:
1. Select an starting (into the weight space) point given by W1 .
2. Calculate rE jW1 and make: d1 = ;rE jW1
3. For i = 1; : : : ; (max. value):
(a) Find the minimum of E (Wi + i di ) with respect to i and move to the next
point Wi+1 = Wi + i(min.) di .
(b) Evaluate the stop condition. It may be error drop under some speci ed value, a
xed number of steps, e.t.c.
(c) Calculate rE jW +1 and then i , using one of the (12.30), (12.31) or (12.32).
Finally calculate the new di+1 direction from (12.25).
(Cycle is to be repeated till the error minima have been found, or some maximal
number of steps have been executed).
i
12.6.4
Scaled Conjugated Gradients
The line search algorithm may have the following drawbacks:
it may involve several calculations of the error function.
it may be sensible on the accuracy of i determination.
For a general network however it is possible to calculate the product between the Hessian
and a vector, e.g. Hdi in the conjugate gradient algorithm, in just O(W ) steps8 . However
there is a possibility that the Hessian is not positive de nite, meaning that the denominator
dT` Hd` of (12.20) may be negative and thus driving an increase of error.
The Hessian may be made positive de nite by performing a translation of the form H !
8 See
chapter \Multi layer neural networks".
❖
232
CHAPTER 12. PARAMETER OPTIMIZATION
H + I where I is the unit matrix and is a suciently large coecient. The formula
(12.20) then becomes:
dTi rE jW
(12.34)
i =; T
di H jW di + i kdi k2
i
i
❖
H j W i , i
❖
i
❖
i , i
0
0
H jW being the Hessian calculated at point Wi and i being the required value of to
make the Hessian positive de nite.
The problem is to choose the i parameter. One way is to start from some value | which
may be 0 as well | and to increase it till the denominator of (12.34) becomes positive. Let
consider the denominator:
i = dTi Hdi + i kdi k2
i
If i > 0 then it can be used; otherwise (i < 0) the i parameter is increased to the new
value i such that the new value of the denominator i > 0:
i = i + (i ; i )kdi k2 > 0 ) i > i ; i 2
kdi k
0
0
0
0
0
An interesting value to choose for i is i = 2 i ;
0
0
i
kdi k2
which gives:
i = ;i + i kdi k2 = ;dTi Hdi
0
✍
❖
i
❖
EQ
Remarks:
➥
➥
If the error is quadratic then i = 0. In the regions of the weights space where
error is badly approximated by a quadratic i have to be increased, in the regions
where error is closer to the quadratic approximation i may be decreased.
One way to measure the degree of quadratic approximation is:
i = EE((WWii));;EEQ((WWii++ iiddii))
(12.35)
where EQ is the local quadratic approximation:
EQ (Wi + i di ) = E (Wi ) + i dTi rE jWi +
1
2
2 dT H jW di
i
i i
and, by replacing into (12.35) and using (12.20):
i = 2 [E (Wi ) d;TEr(EWji + i di)]
W
i i
i
➥
In practice the usual values used are:
8
>
<i+1 = i =2
i+1 = i
>
: = 4
i+1
i
if i > 0:75
if 0:25 < i < 0:75
if i < 0:75
(12.36)
with the supplemental rule to not update the weights if i < 0 but just increase
the value of i and reevaluate i .
12.7. NEWTON'S METHOD
W
233
;H ;1rE
W
;rE
E = const.
Figure 12.8:
➧ 12.7
Newton direction ;H ;1 rE points directly to the error
minima point W .
Newton’s Method
For a local quadratic approximation around minima (12.2) of the error function, the gradient
at a some W point is rE = H jW (W ; W ) and then the minimum point is at:
W = W ; H ;1 jW rE
(12.37)
The vector ;H ;1 rE , named Newton direction, points from the W point directly to the
minimum W . See gure 12.8.
There are several points to be considered regarding the Newton's method:
Since it is just an approximation, this method may require several steps to be performed
to reach the real minimum.
The exact evaluation of the Hessian is computationally intensive, of order O(PW 2 ),
P being the number of training vectors. The computation of the inverse Hessian H ;1
is even more computationally intensive, i.e. O(W 3 ).
The Newton direction may also point to a saddle point or maximum so checks should
be made. This occurs if the Hessian is not positive de nite.
The point given by (12.37) may be outside the range of quadratic approximation,
leading to instabilities in the learning process.
If the Hessian is positive de nite then the Newton direction points towards a decrease of
error | considering the derivative of error along Newton direction:
@E (W ; H ;1 rE ) = ; ;H ;1 rE T rE
@
T ;1
= ; (rE ) H rE < 0
(the matrix property (AB )T = B T AT was used here, as well as the fact that Hessian is
symmetric).
If the Hessian is not positive de nite the a approach similar to that used in section 12.6.4
may be used, i.e. H ! H + I . This represents a compromise between gradient descent
12.7
See [Bis95] pp. 285{287.
Newton direction
234
CHAPTER 12. PARAMETER OPTIMIZATION
and Newton's direction:
For & 0 ) H + I ' H , i.e. the new direction is close to Newton's direction.
For 0 ) H + I ' I , and then ;(I );1 = ; 1 I , i.e. the new direction is
close to rE .
➧ 12.8
❖ "p , "
Levenberg-Marquardt Algorithm
This algorithm is speci cally designed for \sum-of-squares"
error function.
Let "p be the
;
T
error given by the p-th training pattern vector and " = "1 : : : "P . The error function
is then:
E=
P
1X
1
(
"p ) = k "k
2p
2
(12.38)
2
2
=1
❖
Let consider the following matrix:
0 @"
B @W..
=B
@ .
1
CA
: : : @W@"N1
WC
1
1
..
.
...
@"P
@W1
@"P
: : : @W
NW
then, considering a small variation in W weights from step t to t + 1, the error vector "
may be developed in a Taylor series to the rst order:
= " t + (W t ; W t )
and the error function at step t + 1 is:
1
E t = k" t + (W t ; W t )k
2
"(t+1)
( +1)
( )
( +1)
( )
( )
( +1)
( )
(12.39)
2
Minimizing (12.39) with respect to W(t+1) means:
rE j W t =
( +1)
" + (W ; W ) = 0 )
t
t
t
( )
( +1)
( )
"(t) + (W(t+1) ; W(t) ) = 0
T
is not square so rst a multiplication with to the left is performed and then a
multiplication by ( ); again to the left which nally gives:
W t = W t ; ( ); " t
(12.40)
T
T
1
( +1)
( )
T
1
T
( )
which represents the core of Levenberg-Marquardt weights update formula.
From (12.38) the Hessian matrix is:
fH gij =
12.8
See [Bis95] pp. 290{292.
@2E
@wi @wj
=
P
X
@"p @"p
p=1
@ 2 "p
+
"p
@wi @wj
@wi @wj
12.8. LEVENBERG-MARQUARDT ALGORITHM
235
and by neglecting the second order derivatives the Hessian may be considered as:
H
'
T
i.e. the equation (12.40) essentially involves the inverse Hessian. However this is done trough
the computation of the error gradient with respect to weights which may be eciently done
by the backpropagation algorithm.
One problem should be taken care of: the formula (12.40) may give large values for W ,
i.e. so large that the (12.39) approximation no longer apply. To avoid this situation the
following changed error function may be used instead:
E(t+1) =
1 k" + (W
2
(t)
(t+1)
; W )k + kW
(t)
2
(t+1)
;W k
(t)
2
where is a parameter governing the W size. By the same means as for (12.40) the new
update formula becomes:
W(t+1) = W(t) ; (T + I );1 T "(t)
For & 0, (12.41) approaches the Newton formula, for
gradient descent formula.
✍
Remarks:
➥
➥
➥
(12.41)
0 (12.41) approaches the
For suciently large values of the error function is \guaranteed" to decrease
since the direction of change is opposite to the gradient and the step is proportional with 1=.
The Levenberg-Marquardt algorithm is of model trust region type. The model
| the linearized approximation error function | is \trusted" just around the
current point W , the size of region being de ned by .
Practical values for are: ' 0:1 at start then if error decrease multiply by
10; if the error increases go back (restore the old value of W , i.e. undo changes),
divide by 10 and try again.
❖
CHAPTER
13
Feature Extraction
➧ 13.1
Pre/Post-processing
Usually the raw data is not feed directly into the ANN but rather processed before. The
preprocessing have the following advantages:
it allows for dimensionality reduction and thus avoid the course of dimensionality,
it may include some prior knowledge, i.e. information additional to the training set.
Also the ANN outputs are also postprocessed to give the required output format. The pre
and post processing may take any form, i.e. a statistical processing, some xed transformation and/or even a processing involving another ANN.
The most important forms of preprocessing are:
dimensionality reduction | it allows for building smaller ANN's and thus with less
parameters and better suited to learn/generalize1,
feature extraction | it involves making some combination of original training vector
components called features; the process of calculating the features is named feature
extraction,
usually both processes going together, i.e. by dropping some vector components automatically those more \feature rich" will be kept and reciprocal the number of features extracted
is usually much smaller than the dimension of the original pattern vector.
The preprocessing techniques described above will always drive to some loss of information.
However the gain in accuracy neural computation outweighs this loss (of course, assuming
that some care have been taken in the preprocessing phase). The main diculty here is to
nd the right balance between the informational loss and the neural processing gain.
13.1
1
See [Bis95] pp. 295{298.
See the \Pattern Recognition" chapter, regarding the course of dimensionality
237
feature extraction
238
CHAPTER 13. FEATURE EXTRACTION
➧ 13.2
❖ hxi i, i
Input Normalization
One useful transformation is to scale the inputs such that they will be into the same order
of magnitude.
For each component xi , of the input vector x, the mean hxi i and variance i are calculated:
hxi i = P1
P
X
p=1
xip
;
P
X
= P 1; 1 (xip ; hxi i)2
i2
p=1
and then the following transformation is applied:
e
xip
= xip ; hxi i
(13.1)
i
where the new pattern introduced xeip have zero mean and variance one:
e
❖ xip
hxei i = P1
P
X
p=1
e
xip
=0
;
e
i2
P
P
X
X
= P 1; 1 (xeip ; hxei i)2 = P 1; 1 xe2ip = 1
p=1
p=1
A similar transformation may be applied to the target pattern vectors.
✍
whitening
❖ hxi,
Remarks:
While the input normalization performed in (13.1) could be done into the rst
layer of the neural network, this preprocessing makes easier the initial choice of
weights and thus learning: if the inputs and outputs are of order unity the the
weights should also be of the same order of magnitude.
Another, more sophisticated, transformation is whitening . The input training vector pattern
set have the mean hxi and covariance matrix :
➥
hxi = P1
❖ U,
P
X
p=1
xp
;
1
= P ;1
P
X
p=1
(xp ; hxi)(xp ; hxi)T
and considering the eigenvectors fui g and eigenvalues i of the covariance matrix: ui =
i ui then the following transformation is performed:
xep = ;1=2 U T (xp ; hxi)
where
0u
11
B
.
...
U = @ ..
uN 1
1
u1N
.. C
. A
uNN
;
0
BB 01
=B
B@ ..
.
0
...
...
0 0
1
.. C
.C
CC
0A
0
N
being the component i of uj ; f;1=2gij = p1 ij . Because is symmetric, it is
possible to build an orthonormal fui g set2:
ui uj = ij ) U T U = 1
uij
13.2
2
i
See [Bis95] pp. 298{300.
See mathematical appendix.
13.3. MISSING DATA
239
x2
u2
u1
fxp g distribution
fxep g distribution
x1
Figure 13.1:
The whitening process. The new distribution fxep g have
a spherical symmetry and is centered in the origin | in
the eigenvectors system of coordinates.
The mean of transformed pattern vectors is zero and the covariance matrix is unity:
hxei =
P
X
p=1
e =0
(13.2)
xp
P
P
X
X
e = P 1; 1 (xep ; hxei)(xep ; hxei)T = P 1; 1 xep xeTp = ; 12 U T U ; 12 T
p=1
= ; 21 U T U ; 21 T = ; 21 U T 12 T
p=1
21 U ; 21 T = U T U = I
( may be written as = 12 12 ; 21 U ; 12 T = U is true | may be checked by direct
multiplication | and, because of diagonal nature of matrices they equal their transposed).
The (13.2) result shows that in the system of coordinates given by fui g the transformed
distribution of xep is centered in origin and have a spherical symmetry. See gure 13.1.
✍
Remarks:
➥
➧ 13.3
The discussion in this section was around input vectors with a continuous spectrum. For discrete input values the above methods may be inappropriate. In
these cases one possibility would be to use a one-of-k encoding scheme similar
to that used in classi cation.
Missing Data
The \missing data" problem appears when some components of some/all training vector
are unknown.
13.3
See [Bis95] pp. 301{302.
240
❖
x(k) , x(m)
CHAPTER 13. FEATURE EXTRACTION
Several approaches may be taken to deal with the problem:
Discarding data. If the process responsible of missing data is independent of data set
and there is a (relatively) large quantity of data then the incomplete pattern vectors
may be discarded from the training set. This is the most simple approach.
Filling in. The missing data may be lled in with values calculated using various
approaches:
→ Use the means over corresponding data for which values are known. However
this approach is too simplistic and usually gives poor results.
→ Calculate the missing values by a regression method over known data. The
drawback is that the regression function generated is noise-free and then it underestimates the covariance in data.
→ Use the EM (expectation-maximisation) algorithm3 . where missing data may be
treated as mixture components.
→ The general approach: integrate over the corresponding variables by weighting
with the corresponding distribution. This involves the fact that the distribution
itself is modeled.
Let consider that, from the pattern vector x, one part x(k) is known and the rest
x(m) is missing.
Using the Bayes theorem, the posterior probability of x(k) 2 Ck , respectively
x 2 Ck are:
;
P Ck jx(k) =
;
P Ck jx(k) ; x(m) =
p(x(k) jCk ) P (Ck )
p
;
x(k)
p(x(k) ; x(m) jCk ) P (Ck )
p
;
x(k) ; x(m)
Using the above equations, the posterior probability of class Ck , while knowing
only x(k) , may be expressed as:
;
P Ck jx(k) =
=
➧ 13.4
p(x(k) jCk ) P (Ck )
p
p
;
1
;
x(k)
x(k)
Z
X(m)
=
P (Ck )
p
;
;
x(k)
Z
p
X(m)
;
P Ck jx(k) ; x(m) p
;
x(k) ; x(m) jCk
x(k) ; x(m)
dx(m)
dx(m)
Time Series Prediction
The time series prediction problem involves the prediction of a pattern vector x( ) based
of the knowledge of previous behavior of the variable.
3
13.4
Described in \Pattern Recognition" chapter.
See [Bis95] pp. 302{304.
13.5. FEATURE SELECTION
241
The following approaches are common:
The static method.It is assumed that the statistical properties of the data generator
do not evolve in time.
The pattern vector is sampled in time, at regular interval, resulting a series of values,
i.e. time is converted to a discrete form: : : : , x ; , x ; , x , : : :
The training set is build by using one value as output and some n previous values as
input, e.g. x is used as output and x ;n , : : : , x ; as inputs; then x is used as
output and x ;n , : : : , x as inputs, e.t.c.
The rst predicted value may be used as output to make the second prediction, and
so on.
Detrending. The time series may have a simple trend, e.g. increasing or decreasing in
time. This may lead to a pour prediction over time. However, by tting a simple curve
to the data and then removing the value predicted by this simple model the data are
detrended. Only the more complex model (assumed constant in time) remains.
The dynamical approach. This involves a method which allows for retraining in time
and adaptation to the data generator as it changes in time.
2
1
1
+1
+1
➧ 13.5
Feature Selection
Usually, only a subset of the full feature set is used in the training process of ANN. As there
may be many combinations of features, a selection criteria have to be applied in order to
nd the most tted subset of features (the individual input vector components may be seen
as features as well).
One procedure of selection is to train the network on di erent sets of features and then to
test for generalization achieved on a set of pattern vectors not used at learning stage.
As the training and testing may be time consuming on a complex model, an alternative is
to use a simpli ed one for this purpose.
It is to be expected | due to the curse of dimensionality | that there is some optimal
minimum set of features which gives the best generalization. For less features there is too
little information and for more features the dimensionality course intervene.
On the other hand usually a criteria J used in class separability increases with the number
of features X :
J (X ) > J (X )
if X X
(13.3)
+
+
e.g. the Mahalanobis distance4 . This means that this kind of criteria can't be used
directly to compare the results given by two, di erent in size, feature sets.
Assuming that there are N features, there are 2N possible subsets (2 because a feature
may be present or absent from the subset, the whole set may be also considered as a
subset). Considering that the subset is required to have exactly Ne features the number of
possible combination is still N ;NNe Ne . In principle, an exhaustive search trough all possible
2
!
(
13.5
4
)!
!
See [Bis95] pp. 304{310.
De ned in chapter \Pattern Recognition"
❖
J
,X
242
CHAPTER 13. FEATURE EXTRACTION
1
2
3
Figure 13.2:
2
3
4
5
4
4
5
5
3
4
3
4
5
5
4
5
The branch and bound decision tree. Build for a particular case of 5 features. If at one level a feature, e.g.
1, can not be eliminated then the whole subtree, marked
with black dots, is removed from the search.
combinations should be made in order to nd the best subset of feature. In practice even
a small number of features will generate a huge number of subset combinations, too big to
be fully checked.
There are several methods to avoid evaluating the whole feature combination sets.
The branch and bound method
This method gives the optimal solution based on a criteria for which (13.3) is true.
Let assume that there are N features and Ne features to be selected, i.e. there are N ; Ne
features to be dropped.
A top-down decision tree is build. It starts with one node and have N ; Ne levels (not
counting the root itself). At each level one feature is dropped such that at the bottom
there are only Ne left. See gure 13.2.
Note that as the number of features to remain is Ne then the rst level have only N ; Ne .
It doesn't matter what features are present on this level as the order of elimination is not
important. For the same reason one feature eliminated at one level does not appear into
the sub-trees of other eliminated features.
The elimination works as follows:
A random combination of features is selected, i.e. one point from the bottom of the
decision tree. The corresponding criteria used in class separability J (X0 ) is calculated.
See (13.3).
Then one feature is eliminated at a time, going from top to the bottom of the tree.
The criteria J (X ) is calculated at each level.
If at some level J (X ) > J (X0 ) then continue the search. Otherwise (J (X ) < J (X0 ))
there are are the following possibilities:
The node is at bottom of the tree. The new feature combination is better then
the old one and becomes the new level of comparison, i.e. J (X ) ! J (X0 ).
13.6. DIMENSIONALITY REDUCTION
243
If the node is not at the bottom of the tree then the whole subtree is eliminated
from search as being suboptimal. The condition (13.3) is no more met, i.e. a
feature (or combination of features) which shouldn't have been eliminated just
was and all combinations which contain the elimination of that feature shouldn't
be considered further.
The sequential search
This method may give suboptimal solutions but is faster than the previous one. It is based
on considering one feature at a time.
There are two ways of selection:
At each step one feature | the one which gives biggest J (X ) criterion | is added
to the (initial empty) set. The method is named sequential forward selection.
At each step the least important feature | the one witch gives the smaller decrease
of J (X ) criterion | is eliminated from the (initial full) set. The method is named
sequential backward elimination.
➧ 13.6
sequential forward
selection
sequential backward elimination
Dimensionality Reduction
This procedure tries to reduce the dimensionality of input data by combining them using an
unsupervised learning process.
13.6.1 Principal Component Analysis
The problem is to map a set of input vectors fxp g, which are N dimensional, into a set of
corresponding vectors fzp g which have a lower dimensionality K < N .
The x input vector may be represented by the means of an orthonormal set of vectors fui g:
N
x = zi ui ) zi = uTi x where uTi uj = ij
i=1
X
e
e X
A transformation x ! x is performed as follows: from the set fzi g only K are retained
(e.g. the rst ones) and the others are replaced with constants bi
N
K
bi ui ; bi = const.
x = zi ui +
i=1
i=K +1
which practically represents a dimensionality reduction from N to K .
The problem is now to to select the best K set of components from x. This can be done
by trying to minimize the error when switching from x to x, i.e. the di erence between the
two vectors:
N
x;x=
(zi ; bi )ui
i=K +1
X
e X
13.6
See [Bis95] pp. 310{316.
e
❖
e
x , bi
244
❖
EK
❖ hxi
❖
CHAPTER 13. FEATURE EXTRACTION
and for a set of P input vectors
N
P
P X
X
X
(zip ; bi )2
EK = 21 kxp ; xep k2 = 21
p=1
p=1 i=K +1
From the condition of minimum of EK with respect to bi
P
@EK = 0 ) b = 1 X
zip = uTi hxi
i P
@bi
p=1
P
P
xp is the mean input vector.
where: hxi = P1
p=1
Then the error (13.4) may be written as (use (AB )T = B T AT matrix property):
P X
N
N
X
2
1 X
EK = 12
uTi ui
uTi (xp ; hxi) =
2
p=1 i=K +1
i=K +1
where is the covariance matrix of input vectors fxp g:
P
X
= (xp ; hxi)(xp ; hxi)T
p=1
(13.4)
(13.5)
(13.6)
The minima of EK with respect to fui g occurs when the this set is made from the eigenvectors of covariance matrix5 :
ui = i ui
(13.7)
❖
i
i being the eigenvalues. By replacing (13.7) into (13.5) and using the orthogonality of
fui g the error becomes:
N
X
i
EK = 12
(13.8)
i=K +1
and is minimized by choosing the smallest N ; K eigenvalues.
Let consider the translation of coordinates from whatever fxp g was represented to the one
de ned by the eigenvectors fui g and in the same time a translation of the origin of the new
system to hxi, i.e. in the new system hxi = b0; this means that each xp may be represented
as linear combination of fui g:
N
X
xp =
ip ui
i=1
P P
N 2
P
and by replacing in (13.5), considering also the orthogonality of fui g, EK = 12
,
p=1 i=K +1 ip
i.e. EK is a quadratic form. From (13.8) itpfollows that the surfaces EK = const. are hyperellipses with the axes proportional with 1= i and the dimensionality reduction is done by
5
See the mathematical appendix.
13.6. DIMENSIONALITY REDUCTION
x2
245
u2
2;1=2
u1
;1=2
1
hxi
E = const.
x1
Figure 13.3:
The constant-error surfaces are hyper-ellipses. The dimensionality reduction is done by projecting the data
points (black dots) on the axes corresponding to the
smallest eigenvalue i , i.e. in this bidimensional case
on u1
dropping those dimensions corresponding to the smallest axes, i.e. by making a projection
on the axes corresponding to largest i representing the largest spread of data points. See
gure 13.3.
✍
Remarks:
➥
The method described is also known as the Karhunen-Loeve transformation. The
ui are named principal components.
13.6.2 Non-linear Dimensionality Reduction Trough ANN
The principal component analysis performed in the previous section may reduce the dimensionality only trough a linear process.
Sometimes the data have an intrinsic dimensionality which cannot be retrieved by linear
methods. See gure 13.4 on the following page.
Auto-associative ANN may be used to perform dimensionality reduction. The input patterns are used also as targets (hence the \auto-associative" name) but the network have
a \bottleneck" build into hidden layers. See gure 13.5 on the next page. The hidden
layers have less neurons than input and output layers thus forcing the network to \squeeze"
the data trough, thus achieving a dimensionality reduction | the output of hidden layer
representing the dimensionally reduced input data.
✍
Remarks:
➥
➥
The error function used is usually the sum-of-squares.
If the network contains only one hidden layer then the transformation performed
is linear, i.e. equivalent to principal component analysis. Unless it is \hardware"
implemented there is no reason to use single hidden layer ANNs to perform
dimensionality reduction.
246
CHAPTER 13. FEATURE EXTRACTION
x2
x1
Figure 13.4:
x1
Sometimes the data have an intrinsic dimensionality
which cannot be revealed by a linear transformation. In
this case the data points are distributed on a circular
shape and could be described by one dimension, e.g. the
angle measured from one from a reference point.
xN
input layer
non-linear transformation
Z
layer
dimensionality reduction
non-linear transformation
output layer
x1
Figure 13.5:
xN
The use of auto-associative ANNs for dimensionality reduction. The network have a bottleneck, i.e. the hidden
layers have less neurons than input and output layers.
The dimensionality reduction is achieved in layer Z .
13.7. INVARIANCE
➥
➧ 13.7
247
The same design may be used for data compression.
Invariance
In some applications of ANNs the output should be invariant to some transformation of
the input, e.g. the classi cation of the shape of an object should be invariant to the position/rotation of it.
The most straightforward approach | and also the most inecient | is to include into
the training set as many examples of the transformed input as possible. This require a large
amount of training data and gives pour result for transformations uncovered well by the
learning set. Various more e ective alternatives have been developed.
13.7.1
The Tangent Prop Method
This method is applicable for continuous transformations, e.g. translations and rotations
but not mirroring.
Let assume that the transformation of a vector x leads to vector s and is governed by one
scalar parameter , e.g. rotation is de ned by the angle parameter. Then s = s(x; ) and
let M be the reunion of all s vectors for a given x and all possible values of . Also the
\natural" condition s(x; 0) = x is imposed.
Let be the vector de ned as
@ s(x; )
=
@
=0
❖ s,
,M
❖
The change in network output, due to the transformation de ned by is:
N
N
N
X
@yj @xi X
@yj
@yj X
= @x @ = @x i = Jji i
@
i=1
i=1 i
i=1 i
where J is the Jacobian6
J
0 @y
@yj
B@ @x...
=
@xi i=1;N
j =1;K
1
1
@y1
@xN
@yK
@x1
@yK
@xN
...
..
.
(13.9)
1
CA
❖
J
If the network mapping function is invariant to the transformation in the vicinity of each
pattern vector xp then the term (13.9) is heading towards zero and it may be used to build
a regularization function7 of the form:
= 12
P X
K
X
p=1 j =1
(Jjip ip )2
where Jji p and ip are referring to the input vector xp from the training set fxp gp=1;P .
13.7
6
7
See [Bis95] pp. 319{329.
The Jacobian is de ned in the \Multi layer neural networks" chapter.
See chapter \Pattern recognition".
❖ Jjip , ip
248
CHAPTER 13. FEATURE EXTRACTION
e = E+
The new error function becomes: E
determining the in uence of .
✍
Remarks:
➥
In practice the derivative de ning is found by the mean of nite di erence
between x and s obtained from a (relative) small .
13.7.2
❖
x, u
i
where is a regularization parameter
Preprocessing
The invariant features may be extracted at preprocessing stage. These features are named
moments.
Let consider one particular component x of the input vector, described in some system of
coordinates and let fu g be the coordinates of x, e.g. x may be a point on a bidimensional
image and fu1; u2 g may be the Cartesian coordinates, the point being either \lit" (x = 1)
or not (x = 0) depending upon coordinates (u1 ; u2 ).
The moment M is de ned as:
i
❖
M
Z Z x(u ; : : : ; u ) K(u ; : : : ; u ) du : : : du
=
M
U1
kernel function
1
1
N
1
N
N
UN
where K (u1; : : : ; u ) is named kernel function and determines the moments to be considered. For discrete spaces the integrals changes to sums:
X X x(u
N
M
=
i1
1(i1 )
;::: ;u
N (iN )
) K (u
1(i1 )
;::: ;u
N (iN )
) u
1(i1 )
: : : u
N (iN )
iN
(the values being taken at the points i1 ; : : : ; i ).
One of the possible kernel functions is the power function which give rise to regular moments
:
M1
N
regular moments
i ;::: ;iN
M1
i ;::: ;iN
Z
Z
= x(u ; : : : ; u ) u
1
U1
❖
u
i
i
c
i ;::: ;iN
M0;::: ;0
=
U1
central moments
M0;::: ;1;::: ;0
1
i1
1
: : : u du1 : : : du
iN
N
N
UN
(1 in i-th position) then:
Z Z x(u ; : : : ; u ) (u ; u )
and by de ning u
M1
=
N
N
1
1
i1
: : : (u
N
; u ) du : : : du
N
iN
1
N
(13.10)
UN
which is named central moment.
The central moment, de ned in (13.10), is invariant to translations, i.e. to transformations
of type x(u1 ; : : : ; u ) ! x(u1 + u1; : : : ; u + u ), provided that the edge e ects are
neglected, or the U domains are in nite.
N
N
N
i
Proof.
Indeed, the moment calculated for the new pattern is:
f
ci1 ;::: ;iN
M
Z
Z
U1
UN
=
( +
x u1
u1 ; : : : ; uN
+ ) ( 1 ; 1) 1 ( ; )
uN
u
u
i
: : : uN
uN
iN
du1 : : : duN
(13.11)
13.7. INVARIANCE
249
and by making the change of variable u ! u
replacing in (13.11):
i
f
ci ;::: ;i
M
1
N
Z
=
U1
Z
U
N
i
+ u = ue then du = due , also ue = u + u , and by
i
i
i
i
i
x(ue1 ; : : : ; ueN ) (ue1 ; ue1 )i1 : : : (ueN ; ueN )iN due1 : : : dueN
i
i
c
=M
1
N
i ;::: ;i
A moment i1 ;::: ;iN invariable to uniform scaling, i.e. x(u1 ; : : : ; uN ) ! x( u1 ; : : : ; uN )
where is the scaling parameter, may be built as:
i1 ;::: ;iN
=
c
Mi1 ;::: ;iN
c
M
(13.12)
N )=N
1+(i1 ++i
0;::: ;0
c
Because the i1 ;::: ;iN moment is build using only the central moments Mi1 ;::: ;iN then it is
automatically invariant to translations.
Proof.
Let consider a scaling de ned by . Similarly as for translation:
f
ci ;::: ;i
M
1
N
Z
=
U1
=
=
=
Z
N
U
N
N
x( u1 ; : : : ; uN ) (u1 ; u1 )i1 : : : (uN ; uN )iN du1 : : : duN
1
Z
+i1 ++iN
U1
1
Z
+i1 ++iN
U1
Z
U
N
Z
U
x( u1 ; : : : ; uN ) ( u1 ; u1 )i1 : : : ( uN ; uN )iN
d( u1 ) : : : d( u )
x(ue1 ; : : : ; ue ) (ue1 ; ue1 ) 1 : : : (ue ; ue ) N due1 : : : due
N
i
n
N
(13.13)
n
N
i
N
ci ;::: ;i
M
1
N
+i1 ++iN
N
c0
By the same means, for M
;::: ;
c0;::: ;0 =
0 which is: M
f
c0;::: ;0 =
M
R
R
N
c
M0;::: ;0
U1
U
x(u1 ; : : : ; uN ) du1 : : : duN it gives that:
(13.14)
N
Finally, by using (13.13) and (13.13) into (13.12) it gives that e 1
invariant to scaling as well.
i ;::: ;i
N
=1
i ;::: ;i
N , i.e. the moment is
It is possible also to build a moment which is independent to rotation. First the M moment
is rewritten in generalized spherical coordinates8:
Z Z Z
1
Mi1 ;::: ;iN
=
PN
i=1
cos2
2
0
As
2
0
x(r; 1 ; : : : ; N ) (r cos 1 )i1 : : : (r cos N )iN r dr d1 : : : dN
0
i = 1 then the moment:
MR = M2;0;::: ;0 + M0;2;0;::: ;0 + : : : + M0;::: ;0;2
is rotation independent and thus the moment:
R = 2;0;::: ;0 + 0;2;0;::: ;0 + : : : + 0;::: ;0;2
8
See mathematical appendix.
❖ i1 ;::: ;iN ,
250
CHAPTER 13. FEATURE EXTRACTION
A
1
2
B
layer 1
layer 2
layer 3
Figure 13.6:
The shared weights method for bidimensional pattern
vectors. Region A from rst layer activates neuron 1
from second layer, respectively region B activates neuron 2. The weights from A to 1, respectively from B
to 2 are the same, i.e. the same input pattern in A respectively B will give same activation in 1 respectively
2. Layers 1, 2, 3, : : : are in decreasing size order.
is independent to translation, scaling and rotation.
✍
Remarks:
➥
There are two potential problems when using moments: one is that it may be
computationally intensive and the second is that some information may be lost
during preprocessing.
13.7.3 Shared Weights
This technique is using specially build ANNs to allow for a relative translation invariance.
✍
Remarks:
This technique is useful in image processing and bears some resemblance with
mammalian vision.
Considering a bidimensional input pattern the network is build such that a layer will receive
excitatory input only from a small area of the previous layer. By having (sharing) the same
weights, such areas send the same excitatory input to the next layer, when presented with
the same input pattern. See gure 13.6.
➥
13.7.4 Higher-order ANNs
A higher-order network have a neural activation of the form9 :
!
N X
N
N
X
X
wji1 i2 xi1 xi2 + : : :
wji xi +
zj = f
wj 0 +
i1 =1 i2 =1
i=1
9 See
chapter \Multi Layer Neural Networks"
13.7. INVARIANCE
251
i1
t
0
i1
t
0
i2
Figure 13.7:
i2
Translation in bidimensional space. t represents the
translation vector. The pattern values xi1 is replaced
by xi1 , respectively xi2 by xi2 .
0
0
where neuron zj receives input from xi , wj0 is the bias and f is the activation function.
By using second-order neural networks, i.e. of the form:
!
N X
N
X
(13.15)
wji1 i2 xi1 xi2
zi = f
i1 =1 i2 =1
it is possible to built a network whose output is translation invariant for a bidimensional
input pattern.
Considering a translation of a pattern in a bidimensional space then in the place i1 occupied
previously by xi1 now will be an input xi1 which have come from i1 , the same happens for
the second point i2 . See gure 13.7.
To keep the network output (13.15) the same it is necessary to impose the condition:
(13.16)
wji1 i2 = wji i
1 2
0
0
0
0
for each pair of points f(i1; i2 ); (i1 ; i2 )g which may be correlated by a translation.
0
✍
Remarks:
➥
➥
➥
0
The condition (13.16) greatly reduces the number of independent weights (such
that a second order neural network becomes more manageable).
One layer of second order neural network is sucient to extract the translation
invariant pattern information.
Highest order networks may be used for more complex invariant information extraction, e.g. a third order network may be used to extract information invariant
to translation, uniform scaling and rotation by correlating 3 points.
CHAPTER
14
Learning Optimization
➧ 14.1
The Bias-Variance Tradeoff
Let p(tjx) be the target t probability
density, conditioned on input x. The conditional
R
average of the target is htjxi = tp(tjx) dt and the conditional average of the square of
R
targets ht2 jxi = t2 p(tjx) dt.
❖
htjxi, ht2 jxi
Y
Y
For an in nite training set the sun-of-square error function may be written1, considering
only one output, as:
1
E=2
Z
X
y
[ (x)
Z
; htjxi] p(x) dx + 2 [ht2 jxi ; htjxi2 ] p(x) dx
1
2
(14.1)
X
The second term in (14.1) is independent of y(x) and thus independent of weights | it
represents an intrinsic noise which sets the minimum of E . The rst term may be minimized
to 0 in which case y(x) = htjxi.
Finite training data sets S , considered containing P training vector patterns (each), cannot
cover the whole possibilities for input/output and then where will always be some di erence
between y(x) (after training) and htjxi (considering an in nite set).
The integrand of the rst term in (14.1), i.e. [y(x) ;htjxi]2 , gives a measure of how close is
the actual mapping y(x) to the desired target t. To avoid the dependency over a particular
14.1
1
See [Bis95] pp. 332{338.
See chapter \Error Functions", \Signi cance of network output".
253
❖ ES
254
CHAPTER 14. LEARNING OPTIMIZATION
training set, the expectation (average) ES is taken as a measure of the mapping:
measure of mapping ES f[y(x) ; htjxi]2 g
bias
the average being done over the whole ensemble of training sets.
De nition 14.1.1. The bias represents the di erence between the average of network mapping y(x) and the data generator, i.e. htjxi:
bias ES fy(x)g ; htjxi
The average bias over the input x is de ned as:
average bias)2
(
variance
1
2
Z
ES fy(x)g ; htjxi]2 p(x) dx
[
X
For a perfect model, the bias would be 0 (as ES fy(x)g = htjxi). Non-zero bias arises from
two causes:
di erence between the function created by the model (e.g. ANN) and the true function
who generated the data | this is the \true" bias,
variance due to data sets | particular sensitivity to some training sets.
De nition 14.1.2. The variance represents the average sensitivity of the network mapping
y (x) to a particular training set:
variance ES f[y(x) ; ES fy(x)g]2 g
The average variance over the input x is de ned as:
average variance
1
2
Z
ES f[y(x) ; ES fy(x)g]2 g p(x) dx
X
Let consider the measure of mapping [y(x) ; htjxi]2 :
[y (x)
; htjxi]2 = [y(x) ; ES fy(x)g + ES fy(x)g ; htjxi]2
2
2
= [y (x) ; ES fy (x)g] + [ES fy (x)g ; htjxi]
+ 2[y (x) ; ES fy (x)g][ES fy (x)g ; htjxi]
When doing an average over the above equation, the third term cancels (because y(x) !
ES fy(x)g) and then:
ES f[y(x) ; htjxi]2 g = ES f[y(x) ; ES fy(x)g]2 g + [ES fy(x)g ; htjxi]2
=
❖ h(x), "
variance + (bias)2
To see the meaning of bias and variance let consider the targets as being generated from a
function h(x) to which some random noise ", with 0 mean, have been added:
tp = h(xp ) + "p
(14.2)
14.2. REGULARIZATION
255
and the optimal mapping is then htjxi = h(x).
There are two extreme possibilities for the y(x) mapping choice:
The mapping is build to be some g(x) function, completely independent of data set.
Then the variance is zero since ES fy(x)g = g(x) = y(x).
However the bias may be very high (unless some other prior knowledge have been used
to build g(x)).
The mapping is build to t the data perfectly. Then the bias is zero since:
ES fy(x)g = Efh(x) + "g = h(x) = htjxi
However the variance is:
ES f[y(x) ; ES fy(x)g]2 g = ES f[y(x) ; h(x)]2 g = ES f"2 g
and the variance of " may be high.
The above discussion reveals that there is a tradeo between the bias and the variance and,
in practice, a balance between the two, should be found.
One way to reduce the bias and variance at the same time is to increase the complexity
model by using larger training sets, i.e. the size of training set determines the size of the
model such that the optimal balance between bias and variance is found (note however that
this approach leads to the course of dimensionality and thus the model cannot be increased
inde nitely).
Another way to reduce bias and variance at the same time is to use the prior knowledge (if
any) about the data generator (14.2) when building the model.
➧ 14.2
Regularization
The error function E may be changed by adding a regularization term2 :
e
E
=E +
(14.3)
where is a function which have a large value for smooth mapping functions y(x) and a
small value otherwise, i.e. is large for simple models and small for complex models, thus
encouraging less complex models. The parameter controls the in uence of .
Thus the regularization and the error E counterbalance one each other (as error generally
increases in simple models) in the process of minimizing the changed error function Ee during
learning process.
14.2.1
Weight Decay
In the case of a over- tted model the mapping will have areas with large curvature which
require large values for weights3.
14.2
2
3
See [Bis95] pp.338{346.
See chapter \Radial Basis Function Networks"
See chapter \Parameter optimization"
❖
,
256
CHAPTER 14. LEARNING OPTIMIZATION
By encouraging weights to be small the error hyper-surface is kept relatively smooth. This
may be achieved by using a regularization of the form:
X
= 12
i
(14.4)
wi2
Many ANN training algorithm make use of the error gradient. From (14.3) and (14.4):
e
rE = rE + W
(14.5)
(W being seen as vector).
Considering just the part generated by the regularization term, the weight update formula
in the gradient descent learning method4 in the continuous time limit is:
dW
d
e
= ;rE = ;W (rE neglected)
(where is the learning parameter) which have a solution of the form:
( ) = W (0) e;
W
❖ b, H
i.e. the regularization term (14.4) favors weights decay toward 0, following an exponential
rule.
Considering a quadratic error function5 of the form:
( ) = E0 + bT W + 21 W T HW
E W
❖ W
f
❖ W
= const.
(14.6)
b
where H is a symmetrical matrix (it is in fact the Hessian), the minima of E (from rE = 0)
is at W given by:
b
Similarly, the minima of e (from r e = b) occurs at f given by:
r e = + f + f = b
rE = b + HW = 0
E
E
0
b
E
❖ ui , i
b; H
;
(14.7)
W
HW
0
W
(14.8)
(from (14.5) and (14.7)).
Let fui g be an orthogonal system of eigenvectors of H such that:
H ui
= i ui
;
uTi uj
= ij
(14.9)
where i are the eigenvalues of H (such construction is possible due to the fact that H is
symmetrical, see mathematical appendix).
Let consider that the system of coordinates in the weights space is rotated such that it
becomes parallel with fui g. Then W and W may be written as:
W
4
5
=
X
i
wi ui
See chapter \Single Layer Neural Networks".
See chapter \Parameter Optimization"
f
;
f = X e
W
i
wi u i
14.2. REGULARIZATION
257
w2
1=p1
u2
1=p2
W
f
W
u1
E = const.
w1
Figure 14.1:
The hyper-surfaces E =pconst. are hyper-ellipses having
axes proportional to 1= i . The distance between W
f is smaller along axes corresponding to larger
and W
eigenvalues i of H .
and from (14.7) and (14.8) rE = 0b = rEe and considering the orthogonality of fui g it
follows that:
i
w
i + i
wei =
(
) we ' w . As the surfaces E = const. are hyper) jwe j jw j
p
f is closer to W
ellipses which have axes proportional with6 1= this means that W
which means that
i
i
i
i
i
i
i
along ui directions with correspond to larger i . See gure 14.1.
14.2.2 Linear Transformation And Weight Decay
Let consider a multi-layer perceptron network having one hidden
layer and one linear outP
put layer. Then for the hidden layer zj = f wj0 + wji xi and for the output layer
yk = wk0 +
P
j
i
wkj zj .
Considering a linear transformation of the form: xi ! xei = axi + b where a; b = const.
then is is possible to get the same outputs by changing the weights of hidden layer as:
8
>
>
<w ! we = 1 w
a
bP
>
>
:w 0 ! we 0 = w 0 ; a w
ji
ji
j
j
ji
j
i
(may be checked directly, zj doesn't change).
6
See chapter \Parameter Optimization".
ji
258
CHAPTER 14. LEARNING OPTIMIZATION
Error
validation
training
under- tting
over- tting
early stop
Figure 14.2:
time
The error given by the validation set increases from
some point outwards. At that point the network is optimally tted; beyond is over- tted, before is under- tted.
Similarly, for a transformation of the output layer:
the weight changes of the output layer are:
(
wkj
wk 0
yk
! ye
k
= cyk + d where c; d = const.
! we
! we
= cwkj
k0 = cwk0 + d
kj
By training two identical networks, one with original data and one with transformed inputs
and/or outputs the the trained networks should be equivalent, with weights changed by the
linear transformations as shown above. The regularization term (14.4) does not satisfy this
requirement. However a weight decay regularization term of the form:
= 21
X
hidden
layer
w2 +
2
2
X
w2
output
layer
satis es the invariance of linear transformation, provided that suitable rescaling is performed
on 1 and 2 .
14.2.3
validation
Early Stopping
From the whole set of available data for training, usually, some part is put aside and not
used in training, for the purpose of validation, i.e. checking the generalization capability of
network with some data unused during the learning process.
While the error function always decreases for the training set during learning, the error for
the validation set decreases at the beginning then, later, begin to increase as the network
becomes over- tted. See gure 14.2.
The network is optimally tted at the point where the validation set gives the lowest error;
at this point training should be stopped as the generalization capabilities are best, even if
further training will lower error with respect to the training set.
A justi cation of the early stop method may be given by the means of the weight decay
procedure. As the weights adapt during learning, the weight vector W follows a path which
14.2. REGULARIZATION
259
w2
u
f
W
2
W
u
1
E = const.
w1
Figure 14.3:
The \early stop" method gives similar result as the
\weight decay" procedure as it involves stopping on the
learning path | marked with an arrow | before reaching W . The particular form of the learning path may
be explained by the faster advancement along directions
with greater rE , i.e. smaller i . See also gure 14.1 on
page 257.
f before reaching W . See gure 14.3 and section 14.2.1.
leads to W
14.2.4 Curvature Smoothing
As over-complex neural models leads to network mappings with large curvatures (on error
surface), a direct approach will be to build a regularization term which increases with
curvature. As the second derivatives are a measure of the curvature then the regularization
should be of the form:
P X
N X
K @ 2 y (x ) !2
X
1
k p
=2
2
@x
ip
p=1 i=1 k=1
N respectively K being the size of input respectively output vectors.
14.2.5 Choosing weight decay hyperparameter
Considering the weight decay regularization (14.4) then the prior probability density of
weights is chosen
usually as a Gaussian of the form p(w) / exp(; ) which have the
p
variance = 1= .
Considering the logistic activation function7 f (x) = 1+1e;x this one is saturated around
x = 3, i.e. f (;3) ' 0:04 and f (3) ' 0:95 (very close to lower 0 and upper 1 asymptotes).
As the reason (among others) for weight decay regularization is to prevent saturation of
neuronal outputs then the variance of total input is to be around 2. For a small number of
14.2.5
7
See [Rip96] pg. 163.
See also chapter \Pattern Recognition".
❖ N, K
260
CHAPTER 14. LEARNING OPTIMIZATION
inputs in the range [0; 1] a reasonable variance should be between 1 and 10, e.g. the middle
value 5 may be considered, this corresponds to 0:04
In practice is a good thing to have some neurons saturated (this corresponding to a sort of
pruning8) then the base range for is between 0:001 and 0:1. Experience indicate that a
multiplication of up to 5 times is not critical to the learning process.
➧ 14.3
❖ ", pe(")
Adding Noise
Another way to achieve better generalization is to add random noise to the training set
before inputting it to the network.
Let consider a random vector " which is being added to the input x and have probability
density pe(").
In the absence of noise, the error function is9:
K Z Z
1X
E=
[yk (x) ; tk ]2 p(tk jx) p(x) dx dtk
2 k=1
Y X
Considering an in nite number of training vectors, each containing an added noise term,
then
K Z Z Z
X
1
e
E=
[yk (x + ") ; tk ]2 p(tk jx) p(x) pe(") dx dtk d"
(14.10)
2 k=1
Y X "
A reasonable assumption is to consider the noise suciently small as to allow for the expansion of yi (x + ") in a Taylor series, neglecting the third order and above terms:
N @y
N X
N
X
@ 2 yk
1X
k
"i
"i "j
+
+ O("3 ) (14.11)
yk (x + ") = yk (x) +
@xi "=0
2
@xi @xj "=0
i=1
i=1 j=1
It is also reasonable to assume that the noise have zero mean and uncorrelated components:
Z
()
"i pe " d"
= 0 and
"
❖
ZZ
()
"i "j p
e " d"
= ij
(14.12)
"
where is the variance of noise.
By using (14.11) in (14.10) and integrating over " with the aid of (14.12), the error function
may be written as:
e
E
See also section 14.5.
See [Bis95] pp. 346{349.
9
See chapter \Error Functions".
8
14.3
=E+
14.4. SOFT WEIGHT SHARING
where
261
becomes a regularization parameter of the form:
K X
N Z
X
= 12
k=1 i=1
Z (
Y X
@yk
@xi
2
2
+ 12 [yk (x) ; tk ] @@xy2k
i
)
p(tk jx) p(x) dx dtk
(14.13)
where the noise do not appear anymore (normalization of p(") was also used).
The network mapping which minimize the sum-of-squares error function E is10 yk = htk jxi.
Thus the network mapping which minimize the regularized error function Ee, see (14.3),
should be of the form yk = htk jxi + O( ). Then the second term in (14.13):
K X
N Z
1X
4 k=1 i=1 [yk (x) ; htk jxi]
X
@yk
p(x) dx
@xi
(the integration over Y have been performed) cancels to the lowest order of , at the minima
of error Ee . This property makes the computation of second order derivatives of yk (a task
computationally intensive) avoidable, thus leaving the regularization function of the form:
N X
K Z
X
1
=2
i=1 k=1
X
@yk
@xi
2
p(x) dx
(14.14)
or, for a discrete series of training vectors:
P X
N X
K
X
= 21P
p=1 i=1 k=1
➧ 14.4
@yk
@xi xp
!2
Soft Weight Sharing
This method encourages groups of weights to have similar values11 by using some regularization links.
The soft weight sharing relates to weight decay method. Considering a Gaussian distribution
for weights, of the form:
p(w) =
2
p1 exp ; w2
2
then the likelihood12 of the whole set of weights is:
!
N
NW
W
Y
X
1
1
L = p(wi ) = (2)NW =2 exp ; 2 wi2
i=1
i=1
NW
10
14.4
11
12
(14.15)
being the total number of weights. The same way, the negative logarithm of a like-
See chapter \Error functions".
See [Bis95] pp. 349{353.
A hard sharing method is discussed in the chapter \Feature Extraction".
See chapter \Pattern Recognition".
❖ NW
262
CHAPTER 14. LEARNING OPTIMIZATION
lihood gives the error function13, the negative logarithm of likelihood (14.15) gives the
regularization function (14.4) (up to an additive constant which plays no role in a learning
process).
It is possible to encourage weights to group together by using a mixture14 of Gaussian
distributions:
p(w) =
❖
M , m , pm (w)
M
X
m=1
m pm (w )
where M is the number of mixture components, m are the mixing coecients and pm (w)
are the mixture components of the (assumed) Gaussian form:
m )2
pm (w) = p 1 2 exp ; (w ;
2m2
2m
❖
,
(14.16)
being the mean and being the standard deviation.
✍
Remarks:
In the mixture model m = P (m) is the prior probability of the mixture component m.
Then the regularization function is:
NW X
NY
M
W
X
(14.17)
= ; ln L = ; ln p(wi ) = ; ln
m pm (wi )
➥
i=1
i=1
m=1
The regularized error function:
Ee = E +
(14.18)
have to be optimized with respect to weights wi and parameters m , m and m .
✍
Remarks:
➥
❖ m
An optimization with respect to is not possible as it will lead to = 0 and
Ee ! E , i.e. the network will end up by being over- tted. See section 14.2.3.
The posterior probability of mixture component m is:
m pm (w )
m (w) P
M
n pn (w)
(14.19)
n=1
From (14.18), (14.17), (14.19) and (14.16), the error derivatives with respect to wi are:
M
wi ; m
@ Ee = @E + X
@wi @wi m=1 m (wi ) m2
13
14
See chapter \Error Functions".
See also the chapter \Pattern recognition" regarding the mixture models.
14.4. SOFT WEIGHT SHARING
263
which shows that the weights wi are pulled towards the distribution centers m with with
\forces" proportional to the posterior probabilities m .
Similarly the error derivatives with respect to m are:
@ Ee
@m
=
NW
X
;w
m (wi ) m 2 i
m
i=1
which shows that m are pulled towards the sum of all weights, weighted by m .
Finally, the error derivatives with respect to m are:
@ Ee
@m
✍
=
NW
X
i=1
m (wi )
1
m
;
wi ; m )2
3
m
(
Remarks:
➥
In practice the m parameters are taken of the form m = exp(m ) and optimization is performed with respect to m . This ensures that m remains strictly
positive.
As m plays the role of a prior probability it have to follow the probability constrains
M
P
m 2 [0; 1] and
m = 1. This is best done by using the softmax function method:
m=1
exp( m )
m= P
M
exp( n )
n=1
) @@
m
n
=
mn n ; m n
Then the derivative of Ee with respect to m is (by similar ways as for previous derivatives
and using the normalization
@ Ee
@ m
✍
M
X
@ Ee @ n
=
n=1 @ n @ m
Remarks:
➥
=
M
P
n=1
M
X
n=1
"
n = 1):
;
NW (w )
X
n i
i=1
n
!
(
nm m ; n m )
#
=
NW
X
i=1
[
m ; m (wi )]
In practice care should be taken when initializing weights and related parameters.
One way of doing it is to initialize weights with values over a nite interval, then
divide the interval into M subintervals and assigning one to each pm (wi ), i.e.
equal m , m centered into the respective subinterval and m equal to the width
of the subinterval. This method of initialization ensures a partial soft sharing and
allows better learning from the training set.
264
CHAPTER 14. LEARNING OPTIMIZATION
➧ 14.5
growing method
pruning method
Growing And Pruning Methods
The network architecture may play a signi cant role in the nal generalization performance.
To allow for the nding of best architecture, e.g. the number of hidden layers and number
of neurons per hidden layer, two general ways (which avoids an extensive search) may be
used:
The growing method: The starting network is small and then neurons are added one
at a time till some criteria for optimization have been satis ed.
The pruning method: The starting network is big and then neurons are removed one
at a time till some criteria for optimization have been satis ed.
14.5.1
Cascade Correlation
Cascade correlation is of growing method type.
The network starts with a single fully connected layer, where all inputs are linked to outputs.
At each stage | after the network have been trained for an empirical amount of time |
a new hidden neuron is added, such that it receive input from all inputs and all previously
added (hidden) neurons and send its output to all output neurons. See gure 14.4 on the
next page.
The weights from inputs and all previously added neurons to the actually being added hidden
neuron are calculated in one step and after insertion they remain xed, e.g.| in gure 14.4
| the weights on link ➀ are calculated prior to z1 insertion, then the weights on links ➃
and ➂ are calculated prior to insertion of z2 and so on.
The weights from inputs to outputs and from all hidden neurons to outputs remain variable,
to be further adapted, e.g.| in gure 14.4 | only weights on main (input ! output) link
and those on links ➁ and ➄ will continue to adapt during further learning.
✍
❖ " k , h "k i
Remarks:
By the above architecture the network is similar to one having just one \active"
layer (the output) which receive input from input neurons and all hidden (added)
neurons.
The \ xed" weights of the new (in the process of being inserted) neuron z is computed as
follows:
The error "k of output neuron yk and the mean error h"k i over all training set are
calculated rst:
➥
"k = y k
❖ "kp
❖ wi
❖ zp , hz i
; tk
;
1
h"k i = P
X
P
p=1
"kp
("kp being the error "k for input vector xp from the training set).
The weights wi of all inputs to the new neuron z are initialized | weights for links
from all inputs and all previous inserted hidden neurons.
The output of the new neuron z and the mean output over all training set hz i will be:
14.5
See [Bis95] pp. 353{364.
14.5. GROWING AND PRUNING METHODS
265
input layer
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
➀
z1
➃
➂
z2
➁
➄
output layer
Figure 14.4:
The cascade correlation method. The network architecture starts with input and output layers fully interconnected (the thick arrow shows that all output neurons receive all components of the input vector). The the rst
hidden neuron z1 is added and the connections ➀ (z1
receives all inputs) and ➁ (z1 sends its output to all
output neurons) are established. later, when z2 is added
connections ➂ (z2 receive input from z1 ), ➃ and ➄ are
also added. And so on.
0
X
zp = f @
all inputs
1
wi xip A
hz i = P1
;
P
X
p=1
zp
where the sum in zp is done also over the previous inserted hidden neurons (f being
the activation function and zp being the new neuron output for input vector xp ).
The weights are optimized by maximizing the correlation, i.e. covariance, S between
the output of the new neuron to be inserted and the residual actual (before the insertion
takes place) error of the network output | similar to the Fisher discriminant15:
K X
P
X
S=
(zp ; hz i)("k ; h"k i)
k=1 p=1
The maximisation of S is performed by using the derivative with respect to wi :
@S
=
@wi
K X
P
X
k=1 p=1
f 0 xip ("k
; h"k i)
where the sign is given by the expression inside the module from S .
15 See
chapter \Single Layer Neural Networks".
❖ f , zp
❖S
266
CHAPTER 14. LEARNING OPTIMIZATION
The maximisation may be done using the above derivative in a way similar to the
methods used in backpropagation.
14.5.2
Pruning Techniques
Saliency of weights
saliency
The pruning technique involves starting with a (relatively) large network, training it, then
removing some neuronal connections, then training again and repeat the training/pruning
process.
To decide about what network links are less important and thus susceptible to removal it is
necessary to assess the relative importance of weights, this process being named saliency.
Optimal Brain Damage
❖ E , wi , NW
Let consider a small variation in error function E due to a small variation of weights
wi . By developing in series and taking only the rst 2 terms (NW is the total number of
weights):
XW
N
E =
i=1
❖ Hij
@E
@wi
wi +
1
2
XW XW
N
N
i=1 j =1
Hij wi wj +
O(w3 )
2
E .
where H is the Hessian whose elements are Hij = @w@i @w
j
At minimum of E its rst derivatives become zero. Assuming that the Hessian may be
approximated by making all non-diagonal elements equals to zero | this representing the
main assumption of this technique | then:
E =
1
2
X
W
i=1
Hii (wi )2
To remove a neuronal connection is equivalent to make its corresponding weight equal zero:
wi = 0
; wi = ;wi
(14.20)
and then a measure of saliency for weight wi would be:
saliency =
Hii wi2
2
(14.21)
The optimal brain damage technique involves removal of connections de ned by the weights
of lowest saliency.
✍
Remarks:
➥
In practice the number of weights being deleted, the amount of training between
weight removal and the overall stop conditions are empirically established.
14.5. GROWING AND PRUNING METHODS
267
Optimal Brain Surgeon
The optimal brain surgeon technique works in the same manner as the optimal brain damage
but does not assume that the Hessian is diagonal as this may lead to pour results in some
cases.
The variation around minima of E is then (W is the vector of wi ):
E = 12 W T HW
❖
W
(14.22)
Assuming that weight wi is pruned then, equivalent to (14.20):
wi = 0 ; wi = ;wi ,
eTi W
+
wi = 0
(14.23)
where ei is a unit vector;parallel to wi axis, i.e. eTi W
is the projection of w on wi axis;
ei is of the form eTi = 0 0 1 0 0 , with 1 in the i-th position and NW
dimensional.
To nd the new (pruned) weights, E have to be minimized subject to condition (14.23).
The Lagrange multipliers are used16; the Lagrange function is:
L = E + (eTi W + wi ) = 21 W T HW + (eTi W + wi )
and then by zeroing the derivative with respect to W :
rL = HW + ei = 0b ) W = ;H ;1 ei
and, by replacing in (14.23) and considering the form of eTi and thus eTi H ;1 ei = fH ;1 gii
then:
;eTi H ei + wi = 0 ) = fHw;i1 g
ii
such that nally:
W = ; fHw;i1g H ;1 ei
ii
(14.24)
Replacing (14.24) into (14.22) the minimal error variation due to the removal of neural link
corresponding to wi is (H is symmetric and then H ;1 is as well, also use matrix property
(AB )T = B T AT ):
2
2
E = 12 fHw;i1g2 eTi (H ;1 )T HH ;1 ei = 12 fHw;i1 g2 eTi (H ;1 )T ei
ii
ii
wi2
1
=
2 fH ;1 gii
(14.25)
Optimal brain surgery works in similar way to optimal brain damage: after some training
the inverse Hessian is calculated and some weights involving the smaller error variation, as
given by (14.22), are zeroed, i.e. the corresponding inter-neuronal links removed. Then the
procedure is repeated.
16
See mathematical appendix.
❖ ei
268
CHAPTER 14. LEARNING OPTIMIZATION
14.5.3
❖ sj
Neuron Pruning
Instead of pruning inter-neuronal links, it is possible to prune whole neurons.
To be able to choose which neurons may be removed it is necessary to have a measure of
neuron saliency. Such a measure may be the di erence in network output error made by
neuron removal:
sj = Ewith neuron j ; Ewithout neuron j
❖ j
As the above measure is computationally intensive, an alternative approach is to modify
the neuronal input by adding an overall multiplying factor j . The neuronal output is then
written as:
zj = f
j
X
i
!
wji zi
where the activation function f is de ned such that f (0) = 0, e.g. f tanh. Then for
j = 1 the neuron j is present, for j = 0 the neuron is removed.
The saliency of neuron j becomes:
sj = E j =1 ; E j =0
which shows that the derivative of E with respect to j is also a good measure of neuronal
j
saliency:
j
sej = ; @@E
j
j =1
which may be evaluated in a backpropagation way. Note the \;" sign; usually the error
increases after the removal of a neuron, while j decreases and sj is taken as a positive
quantity (and the bigger it is, the more important the corresponding neuron is).
➧ 14.6
❖
M
Committees of Networks
As, in practice it is common to train di erent networks (with di erent architectures) in
order to choose the best it would be much better to combine several networks to form a
committee (it's even not required to be a network, it may be any kind of model).
Let assume that each network have only one output and there are M such networks. The
mapping achieved by each network ym (x) may be seen as being the desired mapping h(x)
plus some added error "m (x):
ym (x) = h(x) + "m(x)
The averaged sum-of-squares error for network m is:
Z
hEm i = Ef[ym (x) ; h(x)]2 g = Ef"2m g = "2m (x) p(x) dx
X
14.6
See [Bis95] pp. 364{369.
14.6. COMMITTEES OF NETWORKS
269
The average error over the whole set of networks when acting individually (not as committee) is:
Eav. = M1
M
X
hEm i =
m=1
1
M
X
❖
Eav.
M m=1 Ef"m g
2
Simple committee
The simplest way of building a committee of networks is to consider the output of the whole
system ycom. as being the average of individual network outputs:
ycom. = M1
M
X
m=1
ym (x)
(14.26)
❖ hEcom. i
The averaged sum-of-squares error for the committee is:
8"
8
#2 9
#2 9
M
M
< 1 X
= <" 1 X
=
y
(x)
"
(x)
hEcom. i = E
=E
m
i
: M m=1
; : M m=1
;
By using the Cauchy inequality in the form:
hEcom. i 6 Eav. ,
networks.
✍
P
M
m=1
"m (x)
❖ ycom.
2
M
6 L P "2m (x) it follows that
m=1
i.e. the error of committee is less than the average error over independent
Remarks:
➥
Considering that the error "m (x) have zero mean and are uncorrelated:
Ef"m g = 0
and
Ef"m "n g = 0
for m 6= n
M
X
1
then the error hEcom. i becomes:
hEcom. i =
1
M 2 m=1 Ef"mg = M Eav.
2
which shows a big improvement. However in practice the error "m(x) is strongly
correlated, the same input vector x giving a similar error on di erent networks.
Weighted committee
Another way of building a committee is to make an weighted mean over individual network
outputs:
ycom. =
M
X
m=1
m ym (x)
(14.27)
As ycom. have the same meaning as ym (x) then clearly the m coecients should have a
meaning of probability, i.e. m is the probability of ym being the desired output from all
❖ m
270
CHAPTER 14. LEARNING OPTIMIZATION
fym g set. The the following conditions should be imposed:
M
X
m 2 [0; 1] ; 8m and
(14.28)
m=1
m=1
and they will be determined as to minimize the network committee error.
✍
❖
Remarks:
By comparing (14.27) with (14.26) it's clear that in the simple model all networks
have equal weight 1=M .
The error correlation matrix C is de ned as the square and symmetrical matrix whose
elements are: Cmn = Ef"m (x) "n (x)g (as it's de ned as expectation, it does not depend
on x).
The averaged sum-of-squares error of the committee becomes:
C
➥
8
<
hEcom. i = Ef[ycom.(x) ; h(x)]2 g = E :
=
E
(
M
X
m "m
m=1
!
M
X
n "n
n=1
M
X
m "m
m=1
!)
=
!2 9
=
(14.29)
;
M
M X
X
m n Cmn
m=1 n=1
To nd the minima of hEcom. i subject to constraint (14.28) the Lagrange multipliers
method17 is used. The Lagrange function is:
L=E+
M
X
m=1
!
m;1
=
M X
M
X
m=1 n=1
m n Cmn +
M
X
m=1
!
m;1
and by zeroing its derivatives with respect to m :
@L
@ m
❖
,
=0
)
2
M
X
n=1
n Cmn + = 0 ; for m = 1; M
;
(as C is symmetrical then Cij = Cji ). Considering the vectors T = 1
= 1b then the above set of equations may be written in matrix form as:
C
2
+
= 0b
which may be solved easily as:
=
; 12 C ;1 ,
M
X
fC ;1 g
m = ;2
n=1
mn
By replacing the value of m from (14.30) into (14.28), is found to be:
2
=; P
M P
M
fC ;1 gmn
m=1 n=1
17
See mathematical appendix.
M and
(14.30)
14.7. MIXTURE OF EXPERTS
271
and then the replacement of back into (14.30) gives the nal values for m :
=
L
P
C ;1 1b
M P
M
P
m=1 n=1
fC ; gmn
(
1)
)
m=
fC ; gmn
1
n=1
M P
M
P
n=1 o=1
The error (14.29) may be written in matrix form as hEcom. i =
the value of , and using the relations:
C ;1 1b)T = 1bT (C ;1 )T = 1bT C ;1 and 1bT C ;1 1b =
(
(C is symmetrical) then the minimum error is hEcom. i =
T
fC ; gno
(14.31)
1
C and then, by replacing
M X
M
X
fC ; gmn
1
m=1 n=1
PM PM fC; gmn
1
1
m=1 n=1
As the weighted committee is similar to the simple committee, the same criteria apply to
prove that hEcom. i 6 Eav.
✍
Remarks:
➥
The coecients found in (14.31) are not guaranteed to be positive so this have to
be enforced by other means. However if all m are positive then from 8 m > 0
and
M
P
m=1
m = 1 ) 8 m 2 [0; 1] (worst case when all coecients are zero except
one which is 1).
➧ 14.7
Mixture Of Experts
The mixture of experts model divides the input space in sub-areas, using a separate, specially
trained, network for each sub-space | the expert | and a supplementary network to act as
a gateway, deciding what expert will be allowed to generate the nal output. See gure 14.5
on the following page.
The error function is build from the negative logarithm of the likelihood function, considering
a mixture model18 of M Gaussian distributions. The number of training vectors is P .
E=;
P
X
p=1
ln
" M
X
m=1
m (xp ) pm (tp jxp )
#
(14.32)
where the pm(tjx) components are Gaussians of the form:
pm(tjx) =
1
exp
(2 )N=2
; kt ; 2m (x)k
2
(N being the dimensionality of x). The m (x) means are functions of input and the
See [Bis95] pp. 369{371.
See chapter \Pattern Recognition" and also chapter \Error Functions" regarding the modeling of conditional distributions.
14.7
18
❖ pm (tjx)
❖
m (x)
272
CHAPTER 14. LEARNING OPTIMIZATION
input
network
expert 1
network
expert M
gateway
network
output
Figure 14.5:
The mixture of experts model. Each \expert" network is
specialized in one input sub-space. The gateway network
decide what expert will be allowed to give the nal output
(by blocking all others). There are M \experts" and the
gateway have one output for each \expert".
covariance is set to unity.
Each \expert" will represent an Gaussian and will output the m (x). The m coecients
are generated trough a softmax function from the outputs m of the gateway:
exp(
P
m= M
n=1
m)
exp(
n)
The mixture of experts is trained simultaneously, including the gateway, by adjusting the
weights such that the error (14.32) is minimized.
✍
Remarks:
➥
➧ 14.8
14.8.1
The model presented here may be extended to a level where each expert becomes
an embedded mixture of experts.
Other Training Techniques
Cross-validation
Often, in practice, several models are build and then the eciency of each is estimated, e.g.
generalization capability using a validation set, in order to select the best one.
14.8
See [Bis95] pp. 371{377 and [Rip96] pg. 41.
14.8. OTHER TRAINING TECHNIQUES
273
x
N
N M
(0)1
y1
(0)
M
y
N
(1)
y
Figure 14.6:
The stacked generalization method. The set of networks
(0)1 ; : : : ; N(0)M are trained using P ; 1 vectors from
the training set then the network N(1) is used to assess
the generalization capability of the level 0 networks.
N
However sometimes the available data is too scarce to a ord to put aside a set for validation.
In this case the cross-validation technique may be used.
The training set is divided into S subsets. The model to be considered is trained using
S ; 1 subsets and the one left as a validation set. There are S such combinations. Then
the eciency of the model is calculated by making an average over all S training/validation
results.
14.8.2 Stacked Generalization
This method is also applicable when the quantity of available data is small and it is desirable
to keep all the \good parts" of various models.
Considering that the number of available training patterns is P then a set of M level 0
networks N(0)1 ; : : : ; N(0)M are trained using only P ; 1 training patterns. See gure 14.6.
The left-out pattern vector is run trough the set of networks N(0)1 ; : : : ; N(0)M this will give
rise to a new pattern (for the next network layer) of the form fy1 ; : : : ; yM g.
The whole procedure is repeated in turn for each of the P patterns, this giving rise to a
new set of P vectors. This new set is used to train a second level network N(1) using as
target the desired output y. Obviously the N(1) assess the generalization capability of the
N(0)1 ; : : : ; N(0)M networks.
Finally the N(0)1 ; : : : ; N(0)M are trained using all P training x patterns.
14.8.3 Complexity Criteria
Complexity criteria are measures of the generalization and complexity of models.
The prediction error is de ned as being the sum between the training error and a measure
274
CHAPTER 14. LEARNING OPTIMIZATION
of the complexity of model:
prediction error = training error + complexity measure
where the complexity measure may be the number of weights.
For small networks the training error will be large and complexity measure low. For large
networks the training error will be low and the complexity measure high. Thus by nding the
minimum of prediction error helps nding the optimal tradeo between model complexity
and generalization capability.
The prediction error, for a sum-of-squares error function is de ned as:
2E + 2NW 2
prediction error =
P
P
where E is the error given by the training set, after learning was completed, P is the number
of training patterns, NW is the total number of weights and is the variance of noise in
data | to be estimated.
Another way of de ning prediction error | which is applicable for non-linear models and
regularized error functions | is:
2E + 2 2
prediction error =
P
P
where is named the e ective number of parameters and is de ned as:
NW
X
= +i
i=1 i
i being the eigenvalues of Hessian and being the multiplication parameter of the regularization term.
14.8.4 Model For Mixed Discrete And Continuous Data
It may happen that the input pattern
vector x have a discrete component along a continuous
;
T
one, i.e. is of the form x = a x1 : : : xN where a takes discrete values and fxi g
continuous ones. In this case one way of modeling the distribution p(x) is to try to nd a
conditional Gaussian distribution of the form:
1
p (a)
T ;1
p(x) =
(2)N=2 jj1=2 exp ; 2 (xb ; a ) (xb ; a )
;
where p (a) is the probability of a taking the value , xbT = x1 : : : xN is the continuous part of the input vector and a and are the means and respectively the covariance
matrix (which may be and a dependent).
CHAPTER
15
Bayesian Techniques
➧ 15.1
Bayesian Learning
15.1.1
Weight Distribution
Let consider a given network, i.e. the number of layers, number of neurons, activation
function of neurons are all known and xed.
Let p(W; ) be the prior probability
density of weights, N the total number of weights,
T
= w1 : : : w W
the weight vector, P the number of training patterns and
T = f 1; : : : ;
g the training set of targets.
The posterior probability density of weights p(W jT ) is (using Bayes theorem):
W
W
t
t
N
P
j
p(W T )
=
j
p(T W ) p(W )
(15.1)
p(T )
where p(T ) represents a normalization factor
which ensures that p(W jT ) integrated over
R
all weight space equals unity, thus p(T ) = p(T jW ) p(W ) dW .
W
✍
Remarks:
➥
As the training set consists of inputs as well, the the probability densities in (15.1)
should be conditioned also on input p(W jT ; X ) = ( ( ) () ) , however the
networks do not model the probability density p( ) of inputs and then X appears
always as a conditioning factor and it will be omitted for brevity.
x
15.1
See [Bis95] pp. 385{398.
275
p T jW;X
p W jX
p T jX
❖
,N ,T
p(W )
W
276
CHAPTER 15. BAYESIAN TECHNIQUES
The Bayesian learning involves the following steps:
Some prior probability density p(W ) is established for weights. This will have a rather
large breath as little is known at this stage.
The posterior probability of the targets p(T jW ) is established.
Using the Bayes theorem (15.1) the posterior conditioned probability density p(W jT )
is found.
15.1.2 Gaussian Prior Weight Distribution
As explained in the previous section the prior probability density p(W ) should be de ned
in a form which will de ne some characteristics of the model but on the other side leave
enough freedom for weights.
Let consider an exponential form:
exp(; EW )
ZW ( )
where EW is a function of weights and ZW is the normalization factor:
1
p(W ) =
❖
EW , ZW
ZW (
Z
)=
W
W
R
p(W ) d = 1.
W
One possible choice for EW is:
such that
exp(
; EW ) d W
1
2
k
W
k
=
2
2
1
NW
X
(15.2)
(15.3)
(15.4)
i=1
which encourages small weights since for large kW k, EW is large and consequently p(W )
is small and thus W have unlikely value.
From (15.3) and (15.4) (see also the mathematical appendix, Gaussian integrals):
N2W
Z1
Z1
NW !
X
2
2
wi dw1 : : : dwNW =
ZW = exp ;
(15.5)
i
=1
;1 ;1
EW
p(W ) =
=
2
N2W
; 2 kWk2
exp
wi2
(15.6)
15.1.3 Application | Simple Classi er
Let consider a neuron with
twoinputs x1 and x2 and one output y. The neuron classi es
;
T
the input vector ; = x1 x2 in two classes C1 and C2 . The weights are w1 and w2 , i.e.
the vector W T = w1 w2 , for inputs x1 respectively x2 . See gure 15.1 on the facing
page{a. The output y represents the probability1 of 2 C1 and 1 ; y the probability of
2 C2 .
x
x
1 See
chapter \Single Layer Neural Networks".
x
15.1. BAYESIAN LEARNING
277
x2
4
x2
x1
w1
2
w2
x4
;4
y
x2
2 C2
;2
;2
2 C1
5
5
2 C1
2
;4
4
C1
x1
C2
a) The simple classi er: one neuron with two inputs
and one output. b) The training data for the classi er,
4 pattern vectors. The dashed line represents the class
decision boundary, x3 and x4 are exceptions.
Let consider that there are 4 training patterns:
x1 =
x3
2 C2
b) the training data
a) the classi er
Figure 15.1:
x1
5
2 C2 ; x2 = ;
;5 2 C1 ; x3 =
0
1
2 C1 ; x2 = ;01 2 C2
The reason for this choice is the following: x1 and x2 are good examples of their respective
class while x3 and x4 are exceptions | it is not reasonable to expect correct classi cation
from one single neuron for them as the decision boundary will not be convex2 (the problem
is not linearly separable). See gure 15.1{b. However x3 and x4 do carry some information:
together with x1 and x2 it suggest the decision boundary is rather as depicted in gure 15.1;
if they would have been absent it the decision of where to draw the \line" would have been
more dicult (lower probability to be the one chosen).
The neuronal activation function is chosen as the sigmoidal function:
y=
1
;
1 + exp( WT x)
=
1
1 + exp(
As probability of x 2 C1 is y and probability of
targets, conditioned on weights is:
p(T jW ) =
Y
xp 2C1
y(xp )
Y
xp 2C2
[1
x
;w1 x1 ; w2 x2 )
2 C2 is 1 ; y then the probability of
; y(xp )] = [1 ; y(x1 )] y(x2 ) y(x3 ) [1 ; y(x4 )]
The prior probability density for weights is chosen as a Gaussian of type (15.6) with
w2 + w2
1
1
2
exp ;
p(W ) = p(w1 ; w2 ) =
2
2
2 See
chapter \Single Layer Neural Networks"
:
=1
278
CHAPTER 15. BAYESIAN TECHNIQUES
3:0
p(W )
0:2
1:8
0:1
p(W )
0:6
0:15
;0 6
0:10
w2
:
0:05
;1 8
0:0
3
;
:
0
w1
3
;3
3
0
w2
0:01
;3 0
;3 0 ;1 8 ;0 6
:
:
a) weight distribution before learning
:
:
0:6
1:8
3:0
w1
b) level curves of weight distribution
before learning
j
3:0
p(W T )
0:4
j
p(W T )
1:8
0:30
0:6
0:2
0:20
w2
;0 6
0:10
:
;1 8
0:0
3
;
0:05
:
w1
0
3
;3
3
0
w2
c) weight distribution after learning
Figure 15.2:
;3 0
;3 0 ;1 8 ;0 6
0:01
:
:
:
:
0:6
w1
1:8
3:0
d) level curves of weight distribution
after learning
The probability density for weights: gures a) and b)
show the prior probability p(W ); gures c) and d) show
the posterior probability p(W jT ).
15.1. BAYESIAN LEARNING
279
and the graphic of this function is drawn in gure 15.2 on the facing page a) and b).
The normalization factor p(T ) =
R1 R1
;1 ;1
p(T jW ) p(W ) dw1 dw2 may be calculated numer-
ically.
) p(W )
Finally, the posterior probabilities of weights is p(W jT ) = p(T jW
and its graphic is
p(T )
depicted in gures 15.2{c and 15.2{d.
The best weights correspond to the maximum of p(W jT ) which occurs at w1 ;0:3 and
w2 ;0:3.
✍
Remarks:
➥
➥
➥
15.1.4
Before any data is available the best weights correspond to the maximum of prior
probability which occurs at w1 = 0 and w2 = 0. This will give y(x) = 0:5,
8x, i.e. there is an equal probability of x belonging to either class | the result
re ecting the absence of all data on which the decision is made.
After the data becomes available the weights are shifted to w1 = ;0:3 and w2 =
;0:3 which gives y(x1 ) ' 0:0475, y(x2 ) ' 0:9525, y(x3 ) ' 0:4255, y(x4 ) '
0:5744. The x3 and x4 are misclassi ed (as it should have been y (x3 2 C1 ) > 0:5
and y(x4 2 C2 ) < 0:5) but this was to be expected, and the patterns still
carry some useful information (they are used to reinforce the established decision
boundary).
In general the prior probability p(W ) is wide and have a low maximum and
posterior probability is narrow and have a high peak(s) | this may be seen also
in gure 15.2 on the preceding page.
Gaussian Noise Model
In general, the likelihood function, i.e. p(T jW ) may be written in exponential form as:
p(T jW ) =
1
ZT (
)
;
(15.7)
ET )
exp(
where ET is the error function, is a parameter and ZT ( ) is a normalization factor which
ensure that the p(T jW ) is normalized:
ZT (
Z
)=
exp(
Y
;
(15.8)
ET ) dt1 : : : dtP
Assuming a sum-of-squares error function3, and that the targets are generated from a
smooth function to which a zero-mean noise have been added, then:
p(tjx; W ) exp
p(T jW ) =
3 See
P
Y
p=1
; 2 [y(x; W ) ; t]
p(tp jxp ; W ) =
chapter \Error Functions".
2
1
ZT (
)
exp
;2
P
X
p=1
and thus
[y (xp ; W )
(15.9)
; tp ]
2
!
❖ ET ,
, ZT
280
CHAPTER 15. BAYESIAN TECHNIQUES
i.e. ET
1
=
2
P
X
p=1
x
y p ; W ) ; tp ]2
[ (
(15.10)
p , and by replacing E into
T
and it becomes evident that controls the variance: = 1=
(15.8) and integrating4
ZT (
)=
P2
2
(15.11)
15.1.5 Gaussian Posterior Weight Distribution
❖
ZS
From (15.1), (15.2) and (15.7), and de ning ZS p(T ) as the normalization factor, then
p(W jT ) =
❖
S
1
exp(
ZS
; EW ; ET ) = Z1
S
RR
exp[
;S (W )]
Wt
(15.12)
t
exp[;S (W )] d
d 1:::d P.
WY
Considering the expression (15.10) of ET and (15.4) of EW then
NW
P
X
X
2
S (W ) =
wi2
[y ( p ; W ) ; t p ] +
2
2
p=1
i=1
where S (W ) = EW + ET and ZS ( ;
)=
x
❖
W
which represents a sum-of-squares error function with a weight decay regularization function5. Since an overall multiplicative term does not count, the multiplicative constant of
the regularization term is = . For the most probable weight vector W for which p(W jT )
is maximum, S is minimum and thus the regularized error is minimum.
15.1.6 Consistent Prior Weight Distribution
The plain weight decay, as the one from (15.4), is not consistent with linear transformation6.
However for two layers the simple weight decay may be changed to the form
EW
=
1
2
X
w2 +
hidden
layer
2
2
X
output
layer
w2
15.1.7 Approximation Of Weight Distribution
In order to simplify the computational process of nding the (maximum of) posterior probability, the function S (W ) may be developed in series and only the most signi cant terms
retained:
S (W ) = S (W ) +
1
2
W;W
(
T
)
W;W
H(
)+
O[(W ; W )3 ]
See mathematical appendix, regarding the Gaussian integrals.
See chapter \Learning Optimization".
6 See chapter \Learning Optimization", also for the form of changed weight decay expression.
4
5
15.2. NETWORK OUTPUTS DISTRIBUTION
281
' S (W ) + 12 (W ; W )T H (W ; W )
where HS is the Hessian of the regularized error function (the term proportional in
( ; ) is zero due to the fact that the series development is done around minimum).
W W
Considering the gradient as a vectorial operator rT
ering a weigh decay EW , the Hessian is7:
HS = (rrT ) S (W ) =
=
@
@w1
:::
(rrT ) ET (W ) + I =
@
@wNW
❖
HS
❖
ZS
❖
g , W , y
then, consid-
H+ I
The posterior distribution becomes:
1 exp ;S (W ) ; 1 (W ; W )THS (W ; W)
2
p(W jT ) =
ZS
(15.13)
where ZS is the normalization coecient for the distribution (15.13) and then8:
Z = exp[;S (W )]
Z
S
W
1
T
exp ; 2 (W ; W ) HS (W ; W ) dW
(15.14)
= exp[;S (W )] (2)NW =2 jHS j;1=2
➧ 15.2
Network Outputs Distribution
The probability density of targets is dependent on input vector and on weights trough the
training set:
x
p(tj ; T ) =
Z
x
p(tj ; W ) p(W jT ) dW
W
x
Considering a sum-of-squares error function, i.e. p(tj ; W ) given by (15.9) and the quadratic
approximation of p(W jT ) given by (15.13), then
1
T
p(tjx; T ) / exp ; [t ; y(x; W )] exp (W ; W ) HS (W ; W ) dW
2
2
Z
2
W
g
(15.15)
W W W
y(x; W ) ' y + g W
Let de ne ryjW and = ; . As the error function was approximated after
a series development around W , the same can be done with y( ; W ):
T
where
x
y y(x; W )
The method of calculation of the Hessian H of the error function is discussed in chapter \Multi Layer
Neural Networks".
See also mathematical appendix.
See [Bis95] pp. 398{402.
7
8
15.2
282
CHAPTER 15. BAYESIAN TECHNIQUES
The targets probability density becomes:
p(tjx; T ) = C exp ; 2 [t ; y ; gT W]2 ; 21 WT HS W dW
Z
W
x
R
where C is a multiplicative constant such that p(tj ; T ) dt = 1.
Y
g W) = W gg W (by using the matrix property: (AB) = B A ) then:
p(tjx; T ) = C exp ; 2 (t ; y)
exp ; 12 W (H + gg )W + (t ; y)g W dW
As (
T
2
T
T
T
Z
2
T
T
T
T
S
T
W
which represent a Gaussian integral with a linear term9 , and then:
p(tjx; T ) = C exp ; 2
exp
= C 0 exp
❖
C0
2
(2)NW =2jHS +
2 T
;
2 h
T
S
2 T;
1
T
1 =2
T 1
S
i
where C 0 = const.
R
The normalization condition p(tj ; T ) dt = 1 (where Y (;1; 1)) gives:
x
Y
C0
❖ t
gg j;
;
2 (t ; y ) g H + gg g
; (t ;2y ) ; g H + gg ; g
(t ; y)2
"
;
g
2 T
2
(HS +
leading to the distribution wanted in the form:
gg );
T
1
#1=2
=1
)2
(
t
;
y
1
where
p(tjx; T ) = p 2 exp ; 22
2t
t
1
t2 =
(15.16)
2
T
; g (HS + ggT );1 g
representing a Gaussian distribution10 with t the variance and hti = y the mean of t.
To simplify the expression of t2 rst the following transformation is done HS ! HS0 = HS =
) HS;1 ! HS0 ;1 = HS , and then the numerator and denominator are both multiplied by
the number gT (I + HS0 ;1 ggT ) g which gives:
T
0 ;1 T
S gg )g
t2 = gT (I + H 0 ;1 ggT )g ;g g(TI (+HH
0 + ggT );1 ggT (I + H 0 ;1 ggT )g
S
S
S
9
10
See the mathematical appendix.
See also the chapter \Pattern Recognition".
15.2. NETWORK OUTPUTS DISTRIBUTION
283
gg g
gg g while for the second,
The rst term of the denominator reduces to T + T HS0 ;1
the matrix property (AB );1 = B ;1 A;1 is applied, in the form:
T
HS0 + ggT );1 = HS0 (I + HS0 ;1 ggT ) ;1 = (I + HS0 ;1 ggT );1 HS0 ;1
(
and then the second term of the denominator becomes:
g
T
I HS0 ;1 ggT );1 HS0 ;1 ggT (I + HS0 ;1 ggT )g
( +
g g
gg
gg g
To simplify the expression a T I is added and subtracted and, in the added term, I is
changed to I = (I + HS0 ;1 T );1 I (I + HS0 ;1 T ). By regrouping, the whole expression
is reduced to T HS0 ;1 T .
Finally the variance (15.16) becomes:
g
gg
t2 =
1
+
g HS g
T
As it measures the \width" of distribution (see the remarks below) the variance t may be
considered as an error bar for y. The 1= part is due to noise in data while the T HS
part is due to the \width" of the posterior probability of weights. The error bar is between
y( ) ; C and y( ) + C, the value of constant C to be established application dependent
(see also the remarks below).
x
✍
g g
x
Remarks:
➥
Considering a unidimensional distribution
p(x) = p 1
2
exp
; (x ;2h2xi)
2
the \width" of it | see gure 15.3 on the following page | is proportional with
the variance . The width of distribution, at the level half of maximum:
p
p(x) = 12 pmax = 21 p 1
, x = hxi 2 ln 2
2
p
is only dependent, being \width" = 2 2 ln 2 2:35 .
➥ The width of probability distribution equals 2 for x = hxi at which point
p
the probability drops at p(hxi ) = pmax= e 0:606 pmax.
15.2.1
Generalized Linear Networks
The generalized linear network have one layer and the output is of the form11
y(x; W ) = WT '(x) where 'T (x) = '1 (x) 'H (x)
;
'j (x) being the activation function of neuron j .
11 See
chapter \Radial Basis Function Networks".
❖
', 'j
284
CHAPTER 15. BAYESIAN TECHNIQUES
p
pmax
pmax=2
p
2 2 ln2
Figure 15.3:
x
The
p Gaussian width. At p(x) = pmax =2 the width is
2 2 ln2 .
The error function of this network is:
S (W ) =
❖
W
, y
P
X
2 p=1[tp ;
W
T
'(xp )]2 + 2 kWk2
and, as is a quadratic function in W, then the weight probability density is strictly Gaussian
with one single maximum.
Considering W as the weight vector corresponding to the maxima of posterior weight
probability and y (x) y(x; W ) then:
y(x; W ) = y(x) + WT '(x)
where W = W ; W .
The Hessian is: HS = (r rT ) S (W ) =
P
P
p=1
'(xp ) '(xp )T + I .
The posterior probability of targets is calculated in a similar way as for (15.15):
p(tjx; T ) /
Z
W
with a variance t2 =
➧ 15.3
exp ; 2 [t ;
1
W
T
'(x)]2 ; 12 WT HS W dW
+ 'THS '.
Classification
Let consider a classi cation problem with two classes C1 and C2 with a network involving
only one output y representing the probability of the input vector x being of class C1 , then
(1 ; y) represents the probability of being of class C2. Obviously the targets t 2 f0; 1g and
the network output y 2 [0; 1].
15.3
See [Bis95] pp. 403{406.
15.3. CLASSIFICATION
285
The likelihood function for observing the whole training set is12:
p(T jW ) =
Y
P
[y(x )] p [1 ; y(x )] ; p = L = exp[;G(T jW )]
t
p
p=1
1
p
(15.17)
t
where G(T jW ) is the cross-entropy:
G(T jW ) = ;
✍
X
P
G
❖
Z
ft ln y(x ) + (1 ; t ) ln[1 ; y(x )]g
p
p=1
❖
p
p
p
Remarks:
P
p(T jW ) = 1.
p 2f0 1g
After the replacement of p(T jW ) from (15.17), the result is a product of terms
of the form y( ) + [1 ; y( )] = 1, i.e. the distribution p(T jW ) is normalized.
The normalization condition for the distribution (15.17) is
➥
t
x
x
p
p
;
x
1
The neuron activation function is taken as the sigmoidal y( ) = f (a) = 1+exp(
; ) where
P
a = w z ; w being the weights of connections from hidden neurons z coming into the
j
j
a
j
j
j
output neuron y and a is the total input.
Considering a prior distribution (15.2) then the posterior distribution, from (15.1), similar
to (15.12), will be:
p(W jT ) =
1 exp(;G ;
Z
E
S
W
) = Z1 exp[;S (w)]
S
where Z is the normalization coecient.
Considering a quadratic approximation as in (15.13), the distribution may be approximated as:
S
1
1
p(W jT ) = exp ;S (W ) ; W
Z
2
S
R
Z being the normalization coecient, and
p(W jT ) d
S
Z =
Z
S
exp
W
;S (W ) ; 1
2
W
W H W dW
T
S
T
H
S
W
(15.18)
W = 1 gives:
= e;
S (W
)
YW r 2
N
i
i=1
(15.19)
(see also the mathematical appendix).
As the network output is interpreted as probability, i.e. y( ; W ) = P (C1 j ; W ) then, for a
new vector the Bayesian learning involves an integration over all possible weights, in the
form:
x
x
x
P (C1 j ; T ) =
Z
W
12 See
x
Z
x
W = y(x; W ) p(W ) dW
P (C1 j ; W ) p(W ) d
W
chapter \Error functions" | section \Cross entropy".
S
286
CHAPTER 15. BAYESIAN TECHNIQUES
Assuming a linear approximation of weights for the total input a then
❖ a ,
g
(x; W ) = a(x) + gTW
where a (x) a(x; W ) and g(x) ra(x; W )jW .
(15.20)
a
The posterior probability of a may be written as:
( jx ) =
p a ;T
Z
W
( jx ) ( j )
p a ;W p W T d
W=
W
Z
( ; a ; gT W) p(W jT ) dW
D a
W
(15.21)
Since, from (15.20), a and are linearly related, and p(W jT ) is a Gaussian, then the
distribution of a is also a Gaussian of the form:
2
( jx ) = p 1 2 exp ; (a ;2sh2ai)
2s
(15.22)
p a ;T
❖ hai, s
with the mean hai and variance s being:
hai = a and
Proof.
1.
s2
= g T HS g
Using (15.21) and (15.18):
;S (W ) Z Z
1
hai = e Z
a D (a ; a ; T ) exp ; T HS
2
S AW
; S (W ) Z T
(a + ) exp ; 12 T HS d
= e Z
S W
g W
g W
W
s2
=
Z Z
W (and
d
R
;
T
2
d
da
1
2
T
H
d
d
; S (W )
g W) exp
W W
T
(a ; a )2 D (a ; a ; gT W) e Z
S
AW
; S (W ) Z
= e Z
(
S W
W W
g W exp ; W S W W = 0 because
W
W = (W)) and the integral is done over a origin centered
Replacing (15.19), considering that a = const. and
the integrand is odd function13 in
interval, then hai = a .
2. The variance is:
W
exp ; 21 WT HS W dW
; 12 WT HS W
W
d
u
u
Let ui be the eigenvectors and i the eigenvalues of HS , i.e. HS i = i i . As HS is symmetrical, then
it is possible14 to build an orthogonal system of eigenvectors. Let consider:
W =
X
wi ui and
g=
X
u
gi i
i
i
and by replacing into the above expression of variance:
!2
Z
X
1 i w2 d
e;S (W )
2
exp
;
gi wi
s =
i
ZS
2
W i
W
13
14
An function f is odd when f (;x) = ;f (;x).
See mathematical appendix.
15.4. THE EVIDENCE APPROXIMATION FOR
AND
287
Developing the square of the sum, the variance becomes a sum of two types of integrals. The rst one,8i 6= j ,
is of the form
Z1 Z1
;1 ;1
gi wi gj wj exp
; 12 (i wi2 + j wj2 )
d(wi ) d(wj ) =
2 Z1
3 2 Z1
3
1
1
2
2
4
5
4
gi wi exp ; i wi d(wi )
=
gj wj exp ; j wj d(wj )5 = 0
2
2
;1
;1
because the integrand are odd function and the integral is done over origin centered interval. The second
one is for i = j :
s
Z1
2
2
(gi wi )2 exp ; i 2wi
;1
g
d(wi ) = i
i
As the HS matrix may be diagonalized using the eigenvalues then s2 =
2
i
NP
W
1 g2
i
i=1 i
= g T HS g .
The posterior probability P (C1 jx; T ) may be written in terms of total neuronal input a as:
P (C1 jx; T ) =
Z
A
P (C1 ja) p(ajx; T ) da =
Z
A
f (a) p(ajx; T ) da
which does not have generally an analytic solution but may be approximated by:
P (C1 jx; T ) ' f (a (s)) where (s) =
✍
1+
1
s2 ; 2
8
Remarks:
➥
➧ 15.4
For the simple, two class problem described above, the decision boundary is
established for P (C1 jx; T ) = 0:5 which corresponds to a = 0. The same result is
obtained using just the most probable weights W and just the network output:
y(x; W ) = 0:5 ) a = 0. Thus the two methods give the same results unless
there are some more complex rules involved, e.g. a loss matrix.
The Evidence Approximation For
And
As the parameters and which do appear in the expression of posterior weight distribution
| see equation (15.12) | are them-self not known, then the Bayesian framework require
an integration over all possible values:
p(W jT ) =
ZZ
p(W j ; ; T ) p( ; jT ) d d
(15.23)
One possible way of dealing with and parameters is known as the evidence approximation
and is discussed below.
The posterior distribution of and , i.e. p( ; jT ), it is assumed to have a \sharp peak"
15.4
See [Bis95] pp. 406{415.
evidence
approximation
❖ p ( ; jT ) , ,
288
CHAPTER 15. BAYESIAN TECHNIQUES
. Then p( ; jT ) . 1 and , using also
around the most probable values
RR and
the normalization condition p( ; jT ) d d = 1, the distribution (15.23) may be
approximated as:
p(W jT ) ' p(W j ; ; T )
hyperprior
evidence
ZZ
p ( ; jT ) d d
=
p(W j ; ; T )
i.e. the integral over all possible values of and is replaced with the use of the most
probable values: and .
To nd the most probable and values, it is necessary to estimate the posterior probability for and . This is achieved by using the Bayes theorem p( ; jT ) = p(T j p; (T) p) ( ; )
where p( ; ) is the prior probability density, also known as the hyperprior . If little is known
about the model to be build then the hyperprior should give a relatively equal value for all
possible and parameters, the same way p(W ) prior operates. The term p(T j ; ) is
named evidence . The term p(T ) is the normalization factor.
The evidence may be written in the form of explicit dependencies over and in the
posterior distribution of targets and weights:
Z
p(T j ;
)=
p(T jW; ;
p Wj ;
) (
)
W
d
Z
=
p(T jW;
W
p Wj )d
) (
as the prior weight distribution is independent of (which is data related) p(W j ; ) =
p(W j ) and the likelihood function is independent of : p(T jW; ; ) = p(T jW; ) | see
equations (15.2) and (15.7).
From the same set of equations | (15.2) and (15.7):
Z
ZS ( ; )
1
exp(;S (W )) d
=
p(T j ; ) =
ZW ( ) ZT ( )
ZW ( ) ZT ( )
W
and considering the values of ZS , ZW and ZT from (15.14), (15.5) and (15.11) respectively
+ E then:
and as S (W ) = EW
T
NW
P
NW + P
1
ln p(T j ; ) = ; EW ; ET ; ln jHS j +
ln +
ln ;
ln(2 )
2
2
2
2
(15.24)
Considering EW an quadratic form on weights then:
HS = (r rT )( EW + ET ) = I + (r rT )ET
=
I +H
H being the Hessian of error function. As the Hessian is a symmetric matrix then it may
15
be diagonalized
Q using its eigenvalues i and, obviously, HS have the eigenvalues i +
and jHS j = (i + ). Finally:
i
d
d
ln
j HS j =
d
d
ln
N
YW
i=1
(
i +
! X
NW
)
=
1
i=1 i +
= Tr
HS;1
(15.25)
where 1=(i + ) are the eigenvalues of HS;1 and the i eigenvalues were supposed to be
-independent, i.e. ET is also a quadratic of weights.
15
See mathematical appendix.
15.4. THE EVIDENCE APPROXIMATION FOR
The condition of minimum of
(15.24), gives:
2
EW
=
NW ;
p Tj ;
ln (
NW
X
)
289
with respect to
where
=
i=1 i +
AND
=
is
NW
X
i
@ ln p
@
=
i=1 i +
= 0
❖
NW
X
i=1
which, from
i
ln
jHS j =
The condition of minimum of
(15.24), gives:
2
d
d
N
YW
ln
i=1
p Tj ;
ln (
ET
=
P;
)
i +
(
!
)
=
1
NW
X
i
with respect to
NW
X
i
i=1 i +
=
P;
(15.28)
i=1 i +
is
@ ln p
@
= 0
which, from
(15.29)
and the same comments as above apply.
As S = EW + ET then, from (15.26) and (15.29), S (W ) = P=2.
To nd out the and parameters an iterative algorithm may be used. Starting with
some initial values 0 and 0 , the values at step t + 1 may be computed from (15.26) and
(15.29):
8
<
:
(t)
t
= 2E
W (t)
t
= 2E
T (t)
( +1)
( +1)
P ; ( t)
i
(15.26)
and may be interpreted as follows:
The prior distribution of weights is usually chosen to be centered in origin and with
a spherical symmetry, so, in the absence of data, the most probable weight vector is
= 0.
W = 0b and consequently EW
When there are data available, the EW shifts to a position given by (15.26). Considering a system of coordinates rotated such that the axes are parallel to the eigenvectors
of H then the quantities by which W is shifted (from the origin) along each axis are
given by i = i+i .
For i ) i & 0, i.e. wi & 0, wi is not shifted much from the origin and the
main contribution to this particular weight(s) is given by the prior weight distribution.
For i ) i . 1, i.e. wi . 1, the main contribution to this weight is given by
data (trough i ). These kind of wi weights are named well-determined parameters.
Thus measures the e ective number of weights changed by the data present in the
training set (all others being given rather small values from the prior distribution).
The Hessian is H = (r rT )ET this means that i / (as the Hessian may be
diagonalized using the eigenvalues) ) di / d and then:
di i
=
(15.27)
d
Using this result, and similar to the previous derivative:
d
d
,
well-determined
parameters
e ective number
290
CHAPTER 15. BAYESIAN TECHNIQUES
For large training sets, i.e. P NW , all weights may be approximated as well-determined
parameters and then i 1 and NW which leads to the much faster updating formulas:
8
<
:
t
( +1)
t
( +1)
=
=
NW
EW (t)
P
2ET (t)
2
Considering a Gaussian approximation for p(T j ;
) then:
1
;
p(T j ln ; ln ) = p(T j ln ; ln ) exp ;
2
0 1
ln ; ln
A and:
where = ln ; ln , = @
0 ;
1
0
;
; 1
1
B
C
;
A
= @ ; ; A = ; @;
T
ln
ln
2
ln
ln
ln
( 1)2
ln
( 1)2
ln
ln
( 1)2
ln
( 1)2
ln
ln
ln
ln
2
ln
ln
2
ln
2
ln
ln
1
1
2
ln
2
ln
2
ln
2
ln
ln
2
ln
2
ln
ln
4
ln
ln
Note that the logarithm of and have been considered (instead of , ) as they are
scaling hyper-parameters.
From the above equations and because @ ln@ = @@ , and same for , then:
(;1)2
ln
=
;
( 1)2
ln
;
( 1)2
ln
ln
ln2
2
ln
ln2
=
=
@
@
;
2
= ; @@(lnln p)2
=;
@
@
= @ ln@ ln@ pln =
@
@
2
4
ln
ln
ln2
ln2
2
ln
;
= ; @@(lnln p) = ;
4
ln
ln
;
ln2
2
ln
2
ln
ln
2
2
;
4
ln
ln
@ ln p
@
@ ln p
@
@ ln p
@
By using (15.24) when calculating the last set of derivatives from the above equations (and
using (15.25), (15.28) and (15.27) as well):
(;1)2
ln
= EW
+ 21
NW
X
i=1
i
(i + )
2
ln(;1)2 =
ln
(;1)
; ln
2
=
2 +
;
( 1)2
ln
ln
NW
i
1X
2 i (i + )
=1
NW
1X
i
2 i (i + )
=1
and then, using (15.26) and (15.29):
ln(;1)2 =
ED +
and ln(;1)2 =
2
N;
;
2 +
( 1)2
ln
ln
2
15.5. INTEGRATION OVER
The ln(;1)2 is a sum of terms
ln
AND
i
( i + )2
291
; for i it reduces to =i 1 while for i
it reduces to i = 1; the only signi cant terms come from ' which are usually in a
small number. Then the the following approximations may be performed:
(;1)2
ln
ln(;1)2 ; ln(;1)2 ln(;1)2
ln
;
( 1)2
ln
ln
'
1
2
ln
'2
; ;
( 1)2
ln
'
'
1
2
ln
(15.30)
P;
2
and the and parameters may be considered statistically independent and their distribution is of the form:
p(T j ln
p(T j ln
➧ 15.5
)=
)=
p(T j ln
) exp
p(T j ln
) exp
;
(ln
;
(ln
; ln
)2
; ln
)2
2
2 ln
!
(15.31)
2
2 ln
And
Integration Over
The Bayesian approach require an integration over all possible values for unknown parameters, i.e. the posterior probability density for weights is:
p(W jT ) =
ZZ
p(W; ; jT ) d d
ZZ
=
p(W jT; ;
p ;
) (
)
d d
Using the Bayes theorem (15.1) and as p(T jW; ; ) = p(T jW; ) (is independent of , see
also section 15.1.4) and p(W j ; ) = p(W j ) (is independent of , see also section 15.1.2)
and considering also that and are statistically independent, i.e. p( ; ) = p( ) p( )
(see also section 15.4) then:
p(W jT ) =
1
ZZ
p(T )
p(T jW;
p W j ) p( ) p( ) d d
) (
Now, a form of p( ) and p( ) have to be chosen. The best option would be to choose
them such that p(ln ) and p(ln ) have a relatively large breath. One possibility, leading
to easy integration, would be:
p(
)=
1
and p(
)=
1
and the integrals over and may now be separated.
Using (15.2) and (15.5) the prior of weights becomes16:
p(W ) =
Z1
0
15.5
16
p(W j ) p( ) d
Z1
=
0
1
ZW (
)
See [Bis95] pp. 415{417.
See also mathematical appendix regarding Euler functions.
exp(
; EW ) 1 d
292
CHAPTER 15. BAYESIAN TECHNIQUES
=
1
(2)NW =2
Z1
exp(; EW )
NW ;1
2
)
d =
;E (NW =2)
(2EW )NW =2
where ;E is the Euler function.
The p(T jW ) distribution is calculated in the same way, using (15.7) and (15.11):
; (P=2)
p(T jW ) = E P=2
(2ET )
From the above equations, the negative logarithm of the posterior distribution of weights is:
; ln p(W jT ) = P ln ET + NW ln EW + const.
2
and then its gradient may be written as:
;r ln p(W jT ) = c rET + crEW
(15.32)
where
P
NW
and c =
(15.33)
c = 2E
2ED
W
are the current values of the parameters.
The minima of ; ln p(W jT ) may be found by iteratively using (15.32) and (15.33)
❖ c, c
✍
➧ 15.6
Remarks:
➥
While the direct integration over and seems to be better that the evidence
approximation (see section 15.4), in practice the approximations required after
integration may give worst results.
Model Comparison
Let consider that there are several models fMmg for the same problem and set of data T .
The posterior probability of a particular model Mm is given by the Bayes theorem
p(T jMm) P (Mm )
P (Mm jT ) =
p(T )
❖ P (Mm ),
p(T jMm)
Mm evidence
where P (Mm ) is the prior probability for model Mm and p(T jMm) is the evidence for Mm .
An interpretation of the model evidence may be given as follows below. Let consider rst
the weight dependency: in the Bayesian framework:
p(T jMm) =
❖
w
prior
2
, w
post
Z
W
p(T jW; Mm) p(W jMm ) d
W
(15.34)
and let consider one single weight: the prior distribution p(W jMm ) have a low maximum
15.6
See [Bis95] pp. 418{422.
15.6. MODEL COMPARISON
293
p
p(wjT; Mm)
p(wjMm )
w
Figure 15.4:
w
post
w
prior
The prior and posterior distribution of weights. The
prior distribution p(W jMm ) have a (relatively) low
maximum and a wide width w
| all weights have
approximatively the same probability, denoting the absence of data on what to make a decision. The posterior
probability density p(W jT; Mm) have a high maximum
and a narrow width w .
prior
post
and a large width w (in fact it should be almost constant over a large range of weight
values) while the posterior distribution p(W jT; Mm) have (usually) a high maximum and
a small width w . See gure 15.4.
Considering a sharp peak of the posterior weight distribution around some maximum w
the the integral (15.34) may be approximated as
p(T jMm) ' p(T jw ; Mm ) p(w jMm ) w
prior
post
post
and also the prior distribution (as is normated) should have a inverse dependency of its
widening p(W jMm ) / 1=w (the wider is the distribution, the smaller is the maximum
and subsequently all values) and then:
w
p(T jMm) / p(T jw ; Mm )
w
prior
post
prior
which represents the product between the likelihood p(T jw ; Mm ), estimated at the most
probable weight value, and a term named the Occam factor. A model with a good t will
have a high likelihood, however these models are usually complex and consequently have a
very high and narrow posterior distribution peak, i.e. a small Occam factor, and reciprocal.
Also for di erent models which make the same predictions, i.e. have the same p(T jW ) the
Occam factor advantages the simpler model, i.e. the one with a larger factor.
Let consider the and hyper-parameter dependency. The Bayesian framework require an
integration over all possible values:
p(T jMm) =
ZZ
p(T j ; ; Mm ) p( ;
jMm ) d d
where p(T j ; ; Mm ) represents the evidence for and | see section 15.4.
(15.35)
Occam factor
294
CHAPTER 15. BAYESIAN TECHNIQUES
The and parameters are considered statistically independent (see again section 15.4).
By using the Gaussian approximation given by (15.31) and considering an uniform prior
distribution p( ) = p( ) = 1= ln , where is a region containing respectively , then
the integral (15.35) may be split in the form:
❖
(ln ; ln )2
; ; Mm ) Z1
p
(
T
j
p(T jMm) =
d(ln )
exp ; 22
(ln )2
ln
;1
Z1
;1
!
)2
(ln
;
ln
exp ; 22
d(ln ) =
ln
= p(T j ; ; Mm ) 2(lnln )2ln
❖
R
The above result was obtained by integrating over a single Gaussian. However in networks
with hidden neurons there is a symmetry and many equivalent maximums17, thus the model
evidence p(T jMm) have to be multiplied by this redundancy factor R, e.g. for a 2-layer
network with H hidden neurons there are R = 2H H ! equivalent maximums, and the model
evidence have to be multiplied by this factor.
On similar grounds as for (15.24), and using (15.30), the logarithm of evidence becomes:
ln p(T jMm) = ; EW ; ED ; 21 ln jHS j + N2W ln + P2 ln + ln R
; 21 ln 2 ; 12 ln P ;2 + const.
where the additive constant is model independent.
By the above means it is possible to calculate the probabilities of various models. However
there are several comments to be made:
The model evidence is not particularly easy to calculate due to the Hessian jHS j.
Choosing the model with highest evidence is not necessary the best option as there
may be several models with signi cant/comparable evidence.
➧ 15.7
❖
mm
Committee Of Networks
Usually the error have several local, non-equivalent, minima, i.e. not due to the symmetry18 .
The posterior probability density may be written as a sum of all posterior distributions
corresponding to the local minima mm :
p(W jT ) =
17
15.7
18
X
p(W; mm jT ) =
X
p(W jmm ; T ) P (mmjT )
m
m
See chapter \Multi Layer Neural Networks", section \Weight-Space Symmetry\.
See [Bis95] pp. 422{424.
See footnote 17.
15.8. MONTE CARLO INTEGRATION
295
By using the above distribution decomposition, other parameters may be calculated by
integration over the weight space W , e.g. the averaged output is:
hyi =
Z y x; W p W jT dW X P m jT Z y x; W p W jm ; T dW
m
m
m
W
W
X P mmjT hymi
(
) (
)
=
(
(
)
) (
)
m
=
(
m
)
where Wi is the portion of the weight space corresponding to the minima mm and hym i is
the output average corresponding to mm . The above formula shows an weighted average
of the outputs corresponding to di erent local minima.
➧ 15.8
Wm , hym i
❖
I
Monte Carlo Integration
The Bayesian techniques often require integration over a large number of weights, i.e. the
computation of integrals of the form:
I=
Z F W p W jT dW
(
) (
(15.36)
)
W
where F (W ) is some integrand. As the number of weights is usually very big, the classical
numerical methods of integration leads to a large computational task.
One way to approximate the above type of integrals is to use Monte Carlo method, i.e. to
select a sample set of weight vectors f i gi=1;L from the distribution p(W jT ) (i.e. the
weights are randomly chosen such that their distribution equals p(W jT )) and then approximate the integral by a nite sum:
W
I ' VLW
XL F Wi
i=1
(
Z
W
jT ) q(W ) dW ' VW
I = F (W ) pq(W
(W )
L
W
W
XL F Wi p WijT
i=1
(
)
(
q(Wi )
e
❖ VW ,
❖
L
q(W )
)
As the normalization of p(W jT ) requires itself an integration of the type (15.36) with
F (W ) = 1 (e.g. see the calculation of ZW , ZT in previous sections) then the integral may
be approximated by using the non-normalized distribution p(W jT ):
See [Bis95] pp. 425{429.
Monte Carlo
method
)
where VW is the volume of weight space and VW =L replaces (approximate) d .
While usually the posterior distribution p(W jT ) may be calculated relatively with ease, the
selection of sample set f i g may be dicult. An alternative is to draw the sample weight
set from another distribution q(W ), in which case the integral (15.36) becomes:
15.8
❖
importance
sampling
❖
e
p(W jT )
296
CHAPTER 15. BAYESIAN TECHNIQUES
L
P
F (W ) pe W jT
(
i q(Wi i )
I'i P
L pe W jT
=1
(
i
i=1 q(Wi )
Metropolis
algorithm
(15.37)
)
this procedure being called importance sampling.
The importance sampling method still have one problem requiring attention. In practice the
posterior distribution is usually almost zero in all weight space except some narrow areas
(see section 15.1.3 and gure 15.2 on page 278). In order to ensure the computation of
integral (15.37) with enough precision it is necessary to choose an L big enough such that
the areas with signi cant posterior distribution p(WjT ) will have adequate coverage.
To avoid the previous problem, the Metropolis algorithm was developed. The weight vectors
in the sample set fWi g form a discrete time series in the form:
Wt
( +1)
random walking
)
=
Wt
( )
+"
where " is a randomly chosen vector from a distribution (e.g. Gaussian) with spherical
symmetry; this kind of series being named random walking. Then the new (t+1) are
accepted or rejected following the rules:
W
(
accept
if p(W(t+1) jT ) > p(W(t) jT )
p
(W
j
T
)
accept with probability p(W jT ) if p(W(t+1) jT ) > p(W(t) jT )
(t+1)
(t)
and considering an error function of the form E = ; ln p(WjT ), then the above rules may
be rewritten in the form:
(
accept
accept with probability
❖
p(")
❖
T
;(E t
( +1)
if E t
; E t )] if E t
( +1)
( )
( +1)
<Et
>Et
( )
( )
(15.38)
The Metropolis algorithm still leaves a problem with respect to local minima. Assuming
that the weights are strongly correlated this means that around (local) minima the hypersurfaces corresponding to constant distribution p(WjT ) = const. are highly elongated hyperellipses19. As the distribution of ", i.e. p("), have spherical symmetry this will lead, according
to the rules (15.38), to many rejected and the algorithm have a tendency to slow down
around the local minima. See gure 15.5 on the next page.
To correct the problem introduced by the Metropolis algorithm the rules (15.38) may be
changed to:
W
(
simulated
annealing
exp[
accept
accept with probability
exp
h
i if E(t+1) < E(t)
E
(t+1) ;E(t)
; T(t+1)
if E(t+1) > E(t)
leading to the algorithm named simulated annealing. The T(t+1) | named \temperature"
| is chosen to have large starting value T(0) 1 and decreasing in time, this way the
algorithm jumping fast over local minima found near the starting point. For T = 1, the
simulated annealing is equivalent to the Metropolis algorithm.
19
See chapter \Parameter Optimization", section "Local quadratic approximation"
15.9. MINIMUM DESCRIPTION LENGTH
w
297
2
E = const.
W(t+1)
"
local minima
p(") = const.
W(t)
w
1
Figure 15.5:
➧ 15.9
The Metropolis algorithm. Only the W(t+1) which falls
in the \area" which may be represented (sort of) by the
intersection of the dotted circle (p(") = const.) and the
ellipse E = const. are certainly accepted. Only some of
the other W(t+1) | from the rest of the circled area |
are accepted. As the (hyper) volumes of the two areas
(of certainly acceptance, respectively partial acceptance)
are proportional with the weight space dimensionality the
algorithm slows more around local minima in the highly
dimensional spaces, i.e. the problem worsens with the
increase of weights number.
Minimum Description Length
Let consider that a \sender" wants to transmit some data D to a \receiver" such that the
message have the shortest length possible. Beyond of the simple method of sending the
data itself; if the quantity of data is suciently large then there is a possibility to shorten the
message by sending a model of the data plus some information regarding the di erence
between the actual data set and the generated (by the model) data set. In this situation
the message length will be the sum between the length of the model description L( ) and
the length of the di erence L(D ), which is model dependent:
M
M
jM
M
message length = L(
M
)+
L(DjM)
The L( ) quantity may also be seen as a measure of model complexity (as the more
complex a model is the bigger its \description" is) and the L(D ) may also be seen
as the error of model (di erence between model output and actual data target). The the
message length may be written as:
jM
message length = model complexity + error
M
(15.39)
The more complex the model is, i.e. L( ) is bigger, the more accurate its predictions are,
and thus the error L(D ) is small. Reciprocally a simple model, i.e. L( ) small, will
jM
15.9
See [Bis95] pp. 429{431.
M
❖
D
L(M),
L(DjM)
❖
298
CHAPTER 15. BAYESIAN TECHNIQUES
generate many errors, leading to a large L(DjM). This reasoning involves that there should
be an optimum balance (tradeo ) between the two, resulting in a minimal message length.
For some variable x the information needed to be transmitted is ; ln p(x) where p(x) is the
probability density20. Then:
message length = ; ln p(M) ; ln p(DjM)
) p(M)
(where p(MjD) is the probability
and by using the Bayes theorem: p(MjD) = p(DjM
p(D)
density for the model M, given the data D) it becomes:
message length = ; ln p(MjD) ; ln p(D)
Let consider that the model M represents a neural network. Then the message length
becomes the length of the weight vector and data, given the speci ed model, L(W; DjM).
The model complexity is measured by the probability of the weight vector given the
model ; ln p(W jM) and the error is calculated given the weight vector and the model
; ln p(DjW; M). The equation (15.39) becomes:
L(W; DjM) = ; ln p(W jM) ; ln p(DjW; M)
(15.40)
To transmit the distributions, both the sender and receiver must agree upon the general
form of the distributions. Let consider the weight distribution as a Gaussian with zero mean
and 1= variance:
N2W
W
exp ; 2 k k2
2
and the error distribution as a Gaussian centered around data to be transmitted. Assuming
one output for network y and P number of targets ftp g to be transmitted then:
p(W jM) =
"
P2
P
X
#
x
exp ; 2 [y( p ) ; tp ]
p=1
The message length becomes the sum-of-squares error function with the weight decay regularization factor:
p(DjW; M) =
L(DjM) =
➧ 15.10
2
P
X
x
2
W
2
2
2 p=1[y( p ) ; tp ] + 2 k k
Performance Of Models
15.10.1 Risk Averaging
x
Given a input vector , a classi er M will categorize it of class Ck for which the posterior
probability is maximum (according to the Bayes rule21 ) P (Ck j ) = max
P (C`j ). Then the
`
See chapter \Error Function", section \Entropy".
See [Rip96] pg. 68.
21
See chapter \Pattern Recognition".
20
15.10.1
x
x
15.10. PERFORMANCE OF MODELS
299
probability of the model to make a correct classi cation equals to the probability of the class
chosen (according to the Bayes rule and for given x) to be the correct one, i.e. its posterior
probability:
Pcorrect (x) = P (Ck jx) = Efmax P (C` jx)g
`
(as the posterior probability is also W -dependent, and W have a distribution associated,
the expected value of maximum was used).
The probability of misclassi cation is the complement of probability of correct classi cation:
Pmc (x) = 1 ; Pcorrect (x) = 1 ; Efmax P (C` jx)g
`
The above formula for misclassi cation does not depend on correct classi cation of x and
then it may be used successfully in the situations where gathering raw data is easy but the
number of classi ed patterns is low. This procedure is known as risk averaging .
K
P
As the probabilities are normated, i.e. P (Ck jx) = 1 then the worst value for max
` P (C` jx)
k=1
is the one for which P (Ck jx) = 1=K , 8k = 1; K , and:
Ef1 ; max
` (C` jx)g
2
6 1 ; 1 Ef1 ; max
` (C` jx)g
P
P
K
The variance22 of max
` P (C` jx) is:
2
Vfmax
i (Ci jx)g = Ef[(1 ; max
` (C` jx)) ; Ef1 ; max
` (C` jx)g] g
P
P
P
2
= Ef(1 ; max
` (C` jx)) g ; Ef1 ; max
` (C`jx)g
P
6 1; 1
K
✍
P
P
mc (x) ; Pmc (x)
2
Remarks:
➥
22
2
In the process of estimating the probability of misclassi cation is better to use
the posterior class probability P (Ck jx) given by the Bayesian inference, rather
than the one given the most probable W set of parameters, because it takes
into account the variability of W and gives superior results especially for small
probabilities.
Variance of a random variable x being de ned as Vf(x ; Efxg)2 g.
risk averaging
CHAPTER
16
Tree Based Classifiers
➧ 16.1
Tree Classifiers
The decision trees are usually built from top to bottom. At each (nonterminal) node a
decision is made till a terminal node, named also leaf, is reached. Each leaf should contain
a class label, each nonterminal node should contain a decisional question. See gure 16.1
on the following page.
The main problem is to build the tree classi er using a training set (obviously having a
limited size). This problem may be complicated by the overlapping between class areas, in
which case there is noise present.
The tree is build by transforming a leaf into a decision making node and growing the tree
further down | this process being named splitting . In the presence of noise the resulting
tree may be over tted (on the training set) so some pruning may be required.
✍
Remarks:
➥
➥
16.1
As the pattern space is separated into a decision areas (by the decision boundaries)
the tree based classi ers may be seen as a hierarchical way of describing the
partition of input space.
Usually there are many possibilities to build a tree structured classi er for the same
classi cation problem so an exhaustive search for the best one is not possible.
See [Rip96] pp. 213{216.
301
leaf
splitting
302
CHAPTER 16. TREE BASED CLASSIFIERS
from previous level
Question
decision making node
paths
Class label
Class label
Figure 16.1:
➧ 16.2
leafs
The tree based classi er.
Splitting
In general the tree is build considering one feature (i.e. component of the input vector) at a
time. For binary features the choice is obvious but for continuous ones the problem is more
dicult, especially if a small subset of features may greatly simplify the emerging tree.
16.2.1 Impurity based method
❖
i(n)
One method of deciding over the splitting method (feature selection and decision boundaries
among features) is to increase \purity", i.e. the pattern vectors who passes trough new
build path should be, with greater probability, from some class(es) (rather than other).
Alternatively the target is to decrease \impurity" which is easier to de ne in quantitative
terms.
Impurity i(n) at the output of node n should be de ned such that is zero if all P (C jx) are
zero except one (which will have the value 1, due to normalization) and be maximum if all
P (C jx) are equal. Two de nitions of impurity are widely used | the probabilities refer to
the current node n:
k
k
Entropy: i(n) = ; P P (C jx) ln P (C jx).
K
k
=1
k
k
Because lim
!0 P ln P = 0 (by L'Hospital rule), ln 1 = 0 and P 6 1 ) ln P 6 0, then
the de ning conditions for impurity are met.
P
Gini index
Gini index: i(n) = P P (C jx) P (C jx) = 1 ; P P 2 (C jx)
=1
6 =1
=
K
K
k
k;`
` k
k
`
k
(last equality derived from normalization condition, squared, i.e.
16.2
See [Rip96] pp. 216{221.
P
K
k
=1
P (C
2
k
jx) = 1).
16.2. SPLITTING
303
Z
The average decrease in impurity after splitting by feature x is:
; ( ) =
i n
( )( )
p x i n
X
dx
such that usually the algorithm building the tree classi er will try to choose that feature
who maximize the above expression.
The average impurity of the whole tree may be de ned as
tree
i
=
XK k ( )
k=1
(16.1)
q i k
where qk is the probability of a pattern vector reaching leaf k (assuming that the nal tree
have a leaf for each class).
k
❖
q
❖
P
16.2.2 Deviance based method
Another approach to the building of the decisional tree is to consider is as a probabilistic
model. Each leaf k may be associated with a distribution which show the probability that
a pattern reaching the node is of some particular class, i.e. P (C`; k).
Considering P (n) the probability of a pattern to reach node n then the conditional probability density p(C`jn) is:
P
(C` ) = (C` j ) ( ) )
;n
P
n P n
P
(C`j ) = (C(` ) )
n
P
(16.2)
;n
P n
Also, taking the number of patterns (from the training set), arriving at leaf k and of class
C`, as being k` , then the likelihood of the training set is:
P
L=
The deviance1 is:
tree
D
= 2 (ln L)for perfect
model
YK YK [ (C`j )]P
k=1 `=1
K
X
; 2 ln L =
k=1
D
P
k
k where
k`
D
K
X
k = ;2
k` ln
`=1
P
P
(C` j ) (16.3)
k
because for the perfect model p(C` jk) = 1 for Pk` > 0 and equals zero otherwise (and
xlim
!0 x ln x = 0) and thus the deviance term associated with the perfect model cancels.
If the total number of patterns arriving at leaf k is Pk then an estimate of p(C` jk) would
be p(C` jk) = Pk` =Pk (also from (16.2): p(C` ; k) / Nk` , p(k) / Nk ), note also that the
training set is assumed to be a unbiased sample from the true distribution). From (16.3),
b
1 See
chapter \Pattern Recognition" for the de nition.
k`
304
CHAPTER 16. TREE BASED CLASSIFIERS
and using
K
P
P
k=1
k`
=
Pk :
Dtree
0K
1
K
X
X
= 2@
Pk ln Pk ;
Pk` ln Pk` A
k=1
`;k=1
Considering the tree impurity (16.1) then qk would be qk
impurity:
i(k) = ;
K
X
Pk`
`=1
Pk
ln
Pk`
Pk
) itree = ;
=
Pk =P and for the entropy
K
X
Pk Pk` Pk`
ln
Pk
k;`=1 P Pk
=
Dtree
2P
and so the entropy and deviance based splitting methods are equivalent.
✍
Remarks:
➥
➥
In practice it may well happen that the training set is biased, e.g. it contains a
larger number of examples from rare classes than it would have been occurred
in a random sample. In this case the probabilities P (C` jk) should be estimated
separately. One way to do this is to attach \weights" to the training patterns
and consider Pk , Pk` and P as the sum of \weights" rather that actual count of
patterns.
If there are \costs" associated with misclassi cation then these may be inserted
directly into the Gini index in the form:
i(n) =
K
X
k;`=1
k =`
Ck` P (Ck jx) P (C` jx)
6
where Ck` is the cost for misclassi cation between classes Ck and C` . Note
that this leads to completely symmetrical \costs" as the total coecient of
P (Ck jx) P (C` jx) (in the above sum) is Ck` + C`k . Thus this approach is ine ective for two class problems.
➧ 16.3
❖ T , R(T ),
S (T )
❖
❖
, R (T )
T(
)
Pruning
Let R(T ) be a measure of the tree T such that the better T is the smaller R(T ) becomes
and such that it have contributions from (and only from) all (its) leaves. Possible choices
are the total number of misclassi cation over a training/test set or the entropy or deviance.
Let S (T ) be the size of the tree T proportional to the number of leaves.
A good criterion for characterizing the tree is:
R (T ) = R(T ) + S (T )
(16.4)
which is minimal for a good one. is a positive parameter that penalizes the tree size; for
= 0 the tree is chosen based only on error rate.
For a given R (T ) there are several possible trees, let T ( ) be the optimal one.
16.3
See [Rip96] pp. 221{225.
16.3. PRUNING
305
Let consider a tree T and for a non-leaf node n the subtree Tn having as root the node n.
Let g(n; T ) a measure of reduction in R by adding Tn to node n, relative to the increase
in size:
R(n) ; R(Tn)
g(n; T ) =
(16.5)
S (Tn ) ; S (n)
❖ Tn
S (n) is the size corresponding to one node n, it is assumed that S (Tn ) represents a subtree
with at least two nodes (it doesn't make sense to add just one leaf to another one) and
thus S (Tn ) > S (n). R(n) is the measure R of node n, considering it a leaf.
From (16.4): R (n) = R(n) + S (n) and R (Tn ) = R(Tn ) + S (Tn ) and using (16.5)
then g(n; T ) may be written as:
g(n; T ) =
R (n) ; R (Tn )
S (Tn) ; S (n)
(16.6)
, R (n) > R (Tn )
(16.7)
+
As the denominator is always positive then:
g(n; T ) >
Proposition 16.3.1. Let consider a tree T and number its nodes from bottom up such that
each child node have a number label smaller than its parent node. Let visit each node in its
number order (i.e. from bottom up) and prune at the current node n if R (n) 6 R (Tn )
where T is the current tree. After visiting all nodes the result is T ( ).
0
0
❖
T
0
It is demonstrated by induction.
For the unpruned tree T all leaves are optimally pruned.
Let consider a current node n. This one is either pruned with the value R (n) or is not, in which case
Proof.
R (Tn ) =
X
0
branches
R (TB ) < R (n)
0
B
the sum being done over all branches B of node n.
But if it is not pruned then it's not possible to have a tree Tn with a smaller R because in this case it will
be (at least) one branch B such that R (TB ) < R (TB ) and thus Tn wouldn't be optimally pruned; i.e.
if the tree is not pruned at node n then the whole subtree (from node n downwards) is optimally pruned.
After analyzing the last node, which (according to the numbering scheme) is the root of whole tree, the
tree is optimally pruned.
00
00
0
0
Let 1 be the smallest value of g(n; T ) for any non-leaf node of T .
The optimally pruned tree is either T for < 1 or T ( 1 ) obtained by pruning all nodes
with g(n; T ) = 1 .
1 is chosen such that 1 = min g (n; T ) 6 g (n; T ), 8n non-leaf node.
n
Proposition 16.3.2.
Proof.
Let consider the rst case when < 1 . But then < g(n; T ) for all non-leaf n and thus from (16.7) it
follows that R (n) > R (Tn ) for all non-leaf nodes and according to the previous proposition no pruning
is performed and the tree is T ( ).
Considering the second case, after pruning all nodes with g(n; T ) = 1 min
g(n; T ), for all non-terminal
n
nodes left in the tree: g(n; T ) > 1 . This means that, according to (16.7), R 1 (n) > R 1 (Tn ) and, by
using previous proposition, no node pruning takes place and the tree is T ( 1 ).
Proposition 16.3.3.
For > , T ( ) is a subtree of T (
)
❖
1
306
CHAPTER 16. TREE BASED CLASSIFIERS
It will be shown by induction that Tn ( ) is a subtree of Tn ( ) for any node n and thus for the root
node as well.
The proposition is true for all terminal nodes (leafs).
It have to be shown that if R (n) 6 R (Tn ) is true then R (n) 6 R (Tn ) is also true and thus when
pruning by the method described in the rst proposition (above) then T ( ) will contain T ( ).
From (16.4):
Proof.
(
R (n) = R(n) + S (n)
R (n) = R(n) + S (n)
) R (n) = R (n) + ( ; ) S (n)
and also (the same way)
(
R (Tn ) = R(Tn ) + S (Tn )
R (Tn ) = R(Tn ) + S (Tn )
and by subtracting the above two equations:
) R (Tn ) = R (Tn ) + ( ; ) S (Tn )
R (n) ; R (Tn ) = [R (n) ; R (Tn )] + ( ; )[S (n) ; S (Tn )]
Considering R (n) 6 R (Tn ) and as > and S (n) < S (Tn ) then R (n) ; R (Tn ) 6 0.
(16.8)
The last two propositions show the following:
There are a series of parameters 1 < 2 : : : found by ordering all g (n; T ). For each
( i;1 ; i ] there is only one optimal tree T ( i ).
For j > i the T ( j ) is embedded in T ( i ) (as j > i and by applying the last
proposition).
➧ 16.4
Missing Data
One of the advantages of the tree classi ers is the ease by which the missing data may be
treated.
The alternatives when data is partially missing from a pattern vector are:
\Drop". Work with the available data \pushing" the pattern down the tree as far as
possible and then use the distribution at the reached node to make a prediction (if
not a leaf, of course).
Surrogate splits. Create a set of surrogate split rules at non-terminal nodes, to be
used if real data is missing.
Missing feature. Consider the \missing" as a possible/supplemental value for the
feature and create a separate branch/split/sub-tree for it.
Split examples. When an pattern vector (from the training set) reaches a node where a
split should occur over its missing features then it is split in fractions over the branches
nodes. Theoretically this should be done using a posterior probability conditioned on
available data but this is seldom available. However it is possible to use the probability
of going along (the node's) branches using the (previous) complete (without missing
data) pattern vectors.
16.4
See [Rip96] pp. 231{233.
CHAPTER
17
Belief Networks
➧ 17.1
Graphs
De nition 17.1.1. A graph is a collection of vertices and edges.
The vertices represent (in the context of belief networks) a set of random variables (each
drawn from some distribution).
A edge represent a pair of distinct vertices.
If the pair is ordered then the graph is directed, otherwise the graph is undirected. For
the ordered edges, the rst vertex is named parent and the second child.
The graphs are represented visually by using nodes for vertices and connecting lines for
edges, ordered edges are shown using arrows. See gure 17.1 on the next page.
De nition 17.1.2. A path is a list of vertices each of them being connected trough an
edge. A cycle is a path which ends on the same vertex where it started and do not go
trough a vertex more than once.
A subgraph is a subset of vertices (from the same graph) together with all edges connecting
them.
A (sub)graph is connected if there is a path for every possible pair of vertices.
A (sub)graph is complete if every possible edge is present.
A maximal complete subgraph (of a graph) is named clique.
De nition 17.1.3. A tree is a connected graph with no cycles. A directed tree is a connected directed acycled graph, abbreviated DAG.
A directed tree have the property that there is a vertex, named root, such that there is a
17.1
See [Rip96] pp. 243{249.
307
graph
vertices
edge
(un)directed
graph
path
cycle
subgraph
connected graph
complete graph
clique
tree
DAG
308
CHAPTER 17. BELIEF NETWORKS
Figure 17.1:
root
ancestral
subgraph
polytree
directed path from the root to any other vertex and any vertex except root have exactly
one incoming edge (arrow), i.e. have just one parent.
De nition 17.1.4. An ancestral subgraph of a directed graph contains all the parents of
its vertices, i.e. it contain also the root (which have no parent).
A polytree is a singly connected graph, i.e. there is only one path between any two vertices.
17.1.1
❖
G, V
❖
A
❖
AC , @A
boundary
Markov properties
❖
'C (xC )
A graph. The one represented here is unconnected, have
a cycle and one ordered edge (shown by the arrow).
Markov Properties
A; B; C V , where V is the whole
set of vertices of graph G . Then C separate A and B in G if any path from any vertex in
A to any vertex in B have to pass trough a vertex from C .
Let xA be the set of (random) variables associated with A and similarly xB , xC . Then the
conditional independence of xA and xB , given xC , is written as:
xA A xB jxC
or (in short) A A B jC
De nition 17.1.5. Let consider 3 subsets of vertices:
De nition 17.1.6. Let AC be the complementary set of A vertices, i.e. AC = V n A. The
C
boundary of A, notated @A is the set of all vertices from A who are connected to a vertex
in A trough an edge.
De nition 17.1.7. The following Markov properties are de ned:
1. Global: if, for any subsets A, B and C of vertices, it is true that xA A xB jxC .
2. Local: if, for some subset A the conditional distribution of xA , given xV nfA[@Ag
depends only on x@A , i.e. xA A xV nfA[@Ag jx@A . Otherwise said, the xA variables
and those not directly connected to them, are conditionally independent.
3. Pairwise: if, for some subsets A and B , xA and xB are conditionally independent
given all other (stochastic) variables and there is no edge from A to B .
Proposition 17.1.1. 1. Let consider a set of (random) variables de ned on the vertices of
a graph G . If the graph have the pairwise Markov property then there exists a set of positive
functions 'C (xC ), de ned over the cliques of G , symmetric in their arguments, such that
P (xV ) /
Y 'C
cliques
17.1.1
See [Rip96] pp. 249{252.
C
(xC )
(17.1)
17.1. GRAPHS
309
i.e. the probability of the graph random variables to have the values xV (all values are
collected together into a vector) is proportional to the product (over its cliques) of functions
'C (the values of components of xC are the same with the corresponding components of
xV , vertex wise).
2. If P (xV ) may be represented in the form (17.1) then the graph have the global Markov
property.
It is assumed that for any vertex s of G the associated variable may take the value 0 (this can
Proof.
1.
be done by some \re-labeling" procedure if it's not true initially).
It will be shown by induction over the size of an subgraph A = fvertex sjxs 6= 0g VG .
The desired probability is built in the form:
b0) Y 'C (xC )
(17.2)
P (xC ; xCC = 0b)
P (xV = b0) Q 'D (xD )
(17.3)
P (xV ) = P (xV
where 'C is de ned recursively as:
'C (xC ) =
=
C A
D (C
and for the empty set D = ; ) 'D (0b) 1 (the sum over D being done for all strict subsets of C ). Also
'C 1 for C non-complete.
Now, for the cases when A = ; and A contains just one vertex, equations (17.2) and (17.3) gives the
identities P (0b) = P (0b), respectively P (xA ) = P (xA ; xAC = b0). so (17.1) holds.
The (17.2) and (17.3) equations are condensed to:
Y P (xC ; xCC = b0)
P (xA ) = P (0b)
Q
C A P (xA = 0b) D(C 'D (xD )
If A is complete then any of its subgraphs is also part of one of its cliques, so the above equation may be
written in the form (17.1). So the proposition also holds for A complete.
Let assume that (17.1) holds for non-complete A having i vertices and let study a new A with i + 1 vertices.
As A is not complete then it is possible to write A = B [ s [ t where B is a subgraph of A having i ; 1
vertices and s and t are not neighbors (there is no edge (s; t)), i.e. xs A xt jxB ; xAC . Also:
P (xV ) = P (xB ; xs ; xt ; xAC = 0b)
b) P (xtjxB ; xs; XAC = 0b)b
= P (xB ; xs ; xt = 0; xAC = 0
P (xt = 0jxB ; xs ; XAC = 0)
and considering the conditional independence of s and t then:
b
P (xV ) = P (xB ; xs ; xt = 0; xAC = 0b) P (xt jxB ; xs = 0; xAC = 0)b
P (xt = 0jxB ; xs = 0; xAC = 0)
b) P (xt; xB ; xs = 0; xAC = 0b)b
= P (xB ; xs ; xt = 0; xAC = 0
P (xt = 0; xB ; xs = 0; xAC = 0)
By using (17.2) (supposed true, induction):
Y ' (x )
P (xB ; xs = 0; xt = 0; xAC = 0b) = P (xV = 0b)
for xBC = 0b
C C
C B
Y ' (x ) for x
P (xB ; xs ; xt = 0; xAC = 0b)) = P (xV = 0b)
C C
fB[sgC = 0b
C fB[sg
Y ' (x ) for X
P (xB ; xs = 0; xt; xAC = 0b)) = P (xV = 0b)
C C
fB[tgC = 0b
C fB[tg
310
CHAPTER 17. BELIEF NETWORKS
and then:
P (xV ) = P (xV
=
Y
0b
)
'C (xC )
C fB[sg
Q 'C xC
C fB[tg
Q 'C xC
0(
0
00 (
0)
00 )
C B
As a complete subgraph C of A cannot contain both s and t vertices (because the is no path between them)
then the P (xV ) can be nally written as:
00
b0 Y
'C (xC )
C fB[s[tg
which shows that the A subgraph of size i + 1 may be written in the same form as the A subgraph of size i.
2. It is assumed that (17.1) is true. For an subgraph A G :
P (xV ) = P (xA jxAC ) P (xAC ) ) P (xA jxAC ) = PP((xxV ))
AC
For P (xV ) the formula (17.1) may be used directly, for P (xAC ), as only xAC is xed and xA may take
any value, a sum over all xA values is to be performed:
'C (xC )
'C (xC )
C G
C
\
A
=
6
;
C
G
P (xA jxAC ) =
'C (xC ) =
'C (xC )
xA
xA
C G
C G
C \A6=;
because, at the denominator, the terms corresponding to cliques C disjoint from A (i.e. C \ A = ;) may be
extracted as a common multiplicative factor (of the sum) and then canceled with the corresponding terms
of the numerator.
C is a clique and some of its vertices are in A. Then C fA [ @Ag (assume by absurd that there is a
vertex s 2 A and s 2 C and another one t 62 fA [ @Ag and t 2 C ; as s; t 2 C and C is complete then
the edge (s; t) does exists, then t 2 fA [ @Ag, contradiction). This means that the right-hand part of the
above equation is in fact just a function of xA[@A , i.e. P (xA jxAC ) = P (xA jx@A ).
Let be A; B; C such that C separate A and B . Let B 0 be the set of vertices which may be reached from B
using a path not going trough C , thus B B 0 , and let D = fB 0 [ C gC . Then B 0 , C and D are disjoint
by construction (i.e. B 0 \ C = ;, B 0 \ D = ; and C \ D = ;).
As A is separated from B by C , while B 0 is not, then no vertex from A may be in either C or B 0 thus
A D.
By construction B 0 [ C [ D = VG (and they are disjoint). Also as B 0 is formed by all vertices which may
be connected trough a path not passing trough C then B 0 and D are separated by C , i.e. B 0 A DjC and,
as A D and B B 0 then A A B jC and thus the global Markov property.
Pr(x ) = Pr(x
V
Q
P Q
cliques
cliques
17.1.2
chain
V
=
)
Q
P Q
cliques
cliques
Markov Trees
Considering a tree, there is a unique path between each node; an undirected tree may be
transformed into a directed one by choosing any vertex as root and then ordering the edges
as to have the same direction as the paths leading from root to other vertices.
The simplest tree is the chain, where each vertex have just one parent and one child, and
each vertex splits the graph in two conditionally independent subgraphs.
✍
17.1.2
Remarks:
➥
Markov chains may be used in time series and conditional independence in this
case may be expressed as \past and future are independent, given the present".
See [Rip96] pp. 252{254.
17.1. GRAPHS
311
Let consider that, for directed chains, each vertex is labeled with a number such that for
each i its parent is i ; 1, the root having label 1. Then the probability of the whole tree is:
x
P(
)=
V
1)
P (x
Y
j
;1 6=1 )
P (xi xi
;i
i
(the root doesn't have a parent and thus its probability is unconditioned).
For directed trees the above formula is to be slightly modi ed to:
x
P(
V
)=
1)
P (x
Y
j
P (xi xj ; j
parent of i)
i
Let consider a vertex t and its associated random variable x . Let consider that the parent
(ancestor) of t is s and the associated x may take the values x 2 fx 1 ; : : : g. Then the
the distribution of x is p expressed by the means of distribution p at vertex s:
t
s
t
pt (xt )
s
X
t
=
s
s
j
ps (xsi ) P (xt xsi )
i
(the sum being done over the possible values for x ). Thus, given the distribution of root,
the other distributions may be calculated recursively, from root to leafs.
Let E ; be the events on the descendants of s, E + all other events and E = E ; [ E + .
The distribution p (x ) = P (x jE ) is to be found (i.e. given the values for some random
variables at the vertices of the graph the problem is to nd the probability of a random
variable taking a particular value for a given vertex). Then (using Bayes theorem):
s
s
s
s
s
s
s
j
P (xs Es )
=
s
s
s
s
j
;
+ ) / P (E ; jx
P (xs Es ; Es
and, as E ; A E + jx then P (E ; jx
s
s
s
j
P (xs Es
j
+)
s Es
+ ) = P (E ; jx ) and:
;
+
) / P (E jx ) P (x jE )
s ; Es
s
+ ) P (x
s ; Es
s
s
s
s
s
s
The rst term is:
81
><
;
(
j )=> Q
:
;j ) = P ( ;j ) ( j
P Es
xs
cildren
of s
where P (E
t
xs
xt
P Et
; jx
P (Et
t
s)
if s have no children
otherwise
.
P xt xs )
xt
For the other term, x is conditionally separated from E + by its parent r:
s
j
+) =
P (xs Es
X
s
j
j
+)
P (xs xr ) P (xr Es
xr
(the sum being done over all possible values for x ). The restriction over the possible values
of x are given by E + trough E + and E ; where q is any child of r, except s.
r
r
s
r
j
q
+ ) = P (x
P (xr Es
j
+ ) Y P (E ; jx
r Er
q
r)
q
and nally the p
s (xs )
may be calculated by using the above formulas recursively.
❖
;, E+, E
Es
s
s
312
CHAPTER 17. BELIEF NETWORKS
17.1.3 Decomposable Trees
triangulated graph
join tree
❖
Si
De nition 17.1.8. A graph is named triangulated (or chordal) if every cycle of four
vertices/edges or more have a chord, i.e. an edge connecting two non-consecutive vertices.
De nition 17.1.9. A join tree is the tree associated with a graph, having its cliques as
vertices. The vertices of the tree are connected in such a way that if all cliques containing
one vertex of the graph are removed, the join tree remains connected.
Considering a join tree and two cliques containing the same vertex (of the graph) then all
cliques sitting on the path, in the join tree, between the two contains the vertex.
For a triangulated graph, the join tree may be built according to the following procedure:
1. Numbering the graph vertices: Starting at any point, number the vertices of the
graph by maximum cardinality search, i.e. number next the vertex with the maximum
number of already numbered neighbors.
2. Starting with the highest numbered vertex, check if all its lower numbered neighbors
are also neighbors between them. If they are then the original graph is triangulated;
if they aren't then by adding missing edges the graph may be triangulated.
3. Identify all cliques and order them by the highest numbered vertex in the clique.
4. Build the join tree by connecting each clique with a predecessor which have the most
common vertices.
The above building procedure ensures also that the (unique) path (in the join tree) from
clique C1 to some Ci passes trough cliques having increasing order numbers.
For i > 2 let Cj(i) be the parent of Ci in the join tree. Let Si = Ci \ (C1 [ : : : [ Ci;1 ).
Then Si Cj(i) .
As j (i) is one of 1
i;1 then i contains all vertices of i \ j (i) . There is not possible to
have a vertex such that 2 i \ k and =
6 ( ) and 62 j(i) because of the way the join tree is built
(steps 3 and 4). Alternatively: as 2 i \ k then is contained by each clique in the path between i
Proof.
C
C ;::: ;C
s
s
C
S
C
s
k
C
C
j i
C
s
C
C
s
C
and Ck (by direct consequence of the de nition, see above) and the path must go trough Cj (i) , the parent
of Ci , as is a path in a tree and by the way the clique have been numbered, Ck lies somewhere above Ci
(on the way to root) or on another tree.
❖
Hi
, Ri
Let Hi = C1 [ : : : [ Ci;1 and Ri = Ci n Si . Then @ Ri \ Hi is a complete subgraph.
Let consider a vertex from i , then either 2 i or 2 i .
Now, let consider an 2 i \ i . From the previous reasoning, either 2 i \ i = i ,
or 2 i \ i i . In either case 2 i j (i) which is complete.
Proposition 17.1.2. Si separate Ri from Hi n Si , i.e. Ri A (Hi n Si )jSi .
S
Let consider a path P from i to i which contains a vertex 2 j for a
and 62
k
k>j
Proof.
s
s
s
S
H
@R
S
Proof.
@R
s
C
s
S
H
s
s
R
S
C
H
S
C
H
s
R
j > i
s
R
(it may happen that there is no k > j ).
Let r and t be two vertices before and after s, in the order of numbering given by the procedure of join tree
building, such that r; t 62 Rj but they are on P and r 2 Ri and t 2 Hi (P starts somewhere in Ri and
ends on Hi ).
By the way of selecting r and t and also from the numbering procedure, r; t 2 @Rj . Being in the vicinity
of Rj then r, t are either in Sj Hj or in @ Cj Hj . Thus r; t 2 @ Rj \ Hj . As @ Rj \ Hj is complete
then the edge (r; t) does exists.
17.1.3
See [Rip96] pp. 258{261.
17.2. CASUAL NETWORKS
313
If the (r; t) edge exists then r and t should be members of the same clique Ck . As r 2 Ri then k > i and
as r; t 2 Hj then k < j .
If k > i, as Hk Hj , then r; t 2 Ck \ Hk C` , where ` > k. Repeat this procedure as necessary,
considering C` instead of Ck , till ` = i and thus t 2 Ci \ Hi = Si , i.e. Si separates Ri and Hi n Si .
Proposition 17.1.3. A distribution which is decomposable with respect to a graph G can
be written as a product of the cliques Ci of G divided by the product of the distributions
of their intersections Si :
P (xV ) =
Y P xC
(
i)
i P (xS )
i
known as the set-chain/marginal representation. If any denominator term is zero then the
whole expression is considered zero.
S
S
For a given j : Ri = Ci = Hj
Qi<j i<j
Q
and then P (xV ) = P (xR jxR1 ; : : : ; xR ;1 ) = P (xR jxH ) and, as xR A xH jxS (see proposii
i
Q
tion 17.1.2), then P (xV ) = P (xR jxS ).
Proof.
i
i
i
i
i
i
i
i
i
i
Ci = Ri [ Si and then P (xCi ) = P (xRi ; xSi ) = P (xRi jxSi ) P (xSi ) and the nal result is obtained by
replacing back into P (xV ).
e
e
Proposition 17.1.4. Let consider the sets of cliques CA and CB , in the join tree, separated
by CC . Considering that CA contains the set of vertices A, CB contains B and CC contains
the set of vertices C , then xA A xB jxC , i.e. A and B are separated by C .
e
e
e
e
It is rst proven, by contradiction, that C separates A n C and B n C on G .
Let consider that there is a path from v0 2 A to vn 2 B , passing trough vertices v1 ; : : : ; vn;1 62 C .
Proof.
Let consider that v0 2 C0 2 CeA (where C0 is some clique containing v0 ). Let consider some vertex vj
such that vj ;1 ; vj 2 Cj (some Cj , note that (vj ;1 ; vj ) is an edge so there some clique containing it). As
vj;1 ; vj 62 C then Cj;1 ; Cj 62 CC . Then, in the join tree, the path Cj;1 to Cj do not pass trough CeC ,
as all contain the vj vertex (by the way the join tree was built).
In this way, by repeating the procedure (with vj ;2 , vj ;1 , e.t.c), it is possible to build a path from CeA to
CeB not passing trough CeC . Contradiction.
Finally, by using the global Markov property, the proposition is demonstrated.
✍
➧ 17.2
Remarks:
➥
Instead of working of the original graph G , it is possible to work on triangulated
G M (obtained from G by the procedure previously explained).
As G M have all edges of G plus (maybe) some more then the separation properties
on G M hold on the original G .
Casual Networks
Basically the casual networks are represented by DAG's. The vertices are numbered according to the topological sort order, i.e. each vertex have associated a number i grater than
17.2
See [Rip96] pp. 261{265.
❖ GM
314
CHAPTER 17. BELIEF NETWORKS
the order number of its parent j (i). Then, considering the graph, the probability of x is:
P (x
V
P (x )
)=
1
Y
V
P (x jx
(17.4)
j (i) )
i
i >1
(of course the root having no parent its distribution is unconditioned). The directions being
from root to the leafs this also means that:
x A x jx ( ) for k < i
k
recursive model
moral graph
i
j i
the DAG having the above property being also named a recursive model.
De nition 17.2.1. The moral graph of a DAG is build following the procedure:
1. All directed edges are replaced by undirected ones.
2. All the (common) parents of a vertex are joined by edges (by adding them if necessary).
Proposition 17.2.1. A recursive model on a DAG have the global Markov property and a
potential representation on its associated moral graph of the form (17.1).
Proof.
The potential representation is built as follows:
1. The potential of each clique is set to 1.
2. For each vertex i, select a clique (any) which contain both i and its parent j (i) and multiply its
potential by P (xi jxj (i) ).
In this way the expression (17.4) is transformed into a potential representation of the form (17.1). By using
proposition 17.1.1 it follows that the graph have the global Markov property.
➧ 17.3
The Boltzmann Machine
The Boltzmann machine have binary random variable associated with the vertices of a graph
completely connected. The probability of a variable x associated to vertex i is obtained
considering a regression over all other vertices. The join distribution is de ned as:
i
P (x
1
0
X
@
w xxA
)=
Z exp
where Z =
1
V
ij
i
j
i;j; i<j
X
xV
1
0
X
w xxA
exp @
ij
the parameters w are the \connection
weights", symmetric (w
P
ization constant (obtained from P (x ) = 1).
ij
i
j
i;j; i<j
ij
=
w ); Z is the normalji
V
xV
The Boltzmann machine learns the join distribution of some inputs x and outputs x ; some
vertices x are \hidden". The join probability over (given) input and output x = x [ x
is obtained by summation over all possibilities for the hidden vertices:
I
H
P (x
S)
17.3
See [Rip96] pp. 279{281.
Y
S
=
X
xH
P (x
V
)
I
Y
17.3. THE BOLTZMANN MACHINE
315
The problem is to nd the weights w given the training set. This is achieved by a gradient
ascent method applied on the log-likelihood function:
ij
L=
X
2
0
X 4X @ X
ln
exp
ln P (xV ) =
The derivative of ln Z is:
@
ln Z
@ wij
xH
training
set
training
set
=
1
Z
X
xV
xi xj
wij xi xj
13
A5 ; ln
Z
i;j; i<j
0
X
exp @
wij xi xj
1
A=
P (xi
= 1; xj = 1)
i;j; i<j
as all terms for which x = 0 or x = 0 cancels.
Considering L1 , the contribution of just one training pattern to the log-likelihood function
then:
i
@
L1
@ wij
j
P
=
xH
xi xj
P (xi
wij xi xj
! ;
i;j; i<j
P exp P
xH
=
exp
!
P
@
ln Z
@ wij
wij xi xj
i;j; i<j
jx ) ;
= 1; xj = 1
S
P (xi
= 1; xj = 1)
and for the whole log-likelihood:
@
L
@ wij
✍
=
X
[Pr(xi = 1; xj = 1
training
set
jx ) ; Pr(
S
xi
= 1; xj = 1)]
Remarks:
➥
To evaluate the above expression there are necessary two simulations: one for
P (x = 1; x = 1) and one for P (x = 1; x = 1jx ) (with \clamped" inputs
and outputs). The resulting algorithm is very slow.
i
j
i
j
S
❖
L1
Advanced Topics
CHAPTER
18
Matrix Operations on ANN
➧ 18.1
New Matrix Operations
As already seen, ANN involves heavy manipulations of large sets of numbers. It is most
convenient to manipulate them as matrices or vectors (column matrices) and it is possible
to write fully matrix formulas for many ANN algorithms. However some operations are the
same across several algorithms and it would make sense to introduce new matrix operations
for them, in order to avoid unnecessary operations and waste of storage space, on digital
simulations.
De nition 18.1.1. The addition/subtraction on rows/columns between a constant and a
vector or a vector and a matrix is de ned as follows:
a. Addition/subtraction between a constant and a vector:
R
a x a1b + x
C
R T
x a1bT ; xT
a
a xT a1bT + xT
a
b.
C
x a1b ; x
Addition/subtraction between a vector and a matrix:
R
xT A 1bxT + A
xT
✍
R
C
x A x1bT + A
A 1bxT ; A
x
Remarks:
➥
C
A x1bT ; A
These operations avoid an unnecessary expansion of a constant/vector to a vector/matrix, before doing an addition/subtraction.
319
❖
R , R , C , C
320
CHAPTER 18. MATRIX OPERATIONS ON ANN
The operations de ned above are commutative.
➥ When the operation involves a constant then it represents in fact an addiR
tion/subtraction from all elements of the vector/matrix. In this situation
C
R
C
is practically equivalent with and is equivalent with (and they could be
replaced with something simpler, e.g. and ). However, it seems that not
introducing separate operations keeps the formulas simpler and easier to follow.
De nition 18.1.2. The Hadamard product between a matrix and vector (row/column matrix) is de ned as follows:
➥
❖
R C
,
xT
✍
❖
H
R
A b1xT A
and
x
C
A x1bT A
Remarks:
These operations avoid expansion of vectors to matrices, before doing the Hadamard product.
➥ They seems to ll a gap between the operation of multiplication between a constant and a matrix and the Hadamard product.
De nition 18.1.3. The (meta)operator H takes as arguments two matrices of the same
size and three operators. Depending over the sign of elements of the rst matrix, it applies
one of the three operators to the corresponding elements of the second matrix. It returns
the second matrix updated.
Assuming that the two matrices are A and B , and the operators are , and then
HfA; B; ; ; g = B , the elements of B being:
➥
0
0
8
>
<
b if aij > 0
bij = > (bij ) if aij = 0
:
(bij ) if aij < 0
where aij , bij are the elements of A, respectively B .
0
✍
( ij )
Remarks:
➥
➧ 18.2
18.2.1
While H could be replaced by some operations with the sign function, it wouldn't
be as ecient when used in simulations, and it may be used in several ANN
algorithms.
Algorithms
Backpropagation
Plain backpropagation
Using de nition 18.1.1, formulas (2.7b) and (2.7c) are written as:
rz` E = cW`T+1
rz +1 E
`
z`+1
(1
C
z`+1 )
for ` = 1; L ; 1
18.2. ALGORITHMS
321
(rE ) = c rz E
(1 C z ) zT;1
z`
`
`
`
for ` = 1; L
`
and equations (2.10b) and (2.10c) becomes:
rz E = cW T+1 rz +1 E
^
`
`
`
(rE ) = c rz E
`
`
z`+1
(1 C z +1)
for ` = 1; L ; 1
`
(1 C z ) ezT;1
z`
`
for ` = 1; L
`
Backpropagation with momentum
Formulas (2.12a), (2.12b) and (2.12c) change to:
(rE )
^
^
`;pseudo
(rE )
(rE )
= c rz E
`
`;pseudo
`
C
f 0 (a ) c
= rz E
`
`
`;pseudo
(1 C z ) C c
z`
= c rz E
`
z`
f
(1
C
f
zT;1
`
ezT;1
`
C
z` ) cf
ezT;1
`
Adaptive Backpropagation
From (2.13) and using de nition 18.1.3, (2.14) is replaced by:
(t) = HfW (t)
W (t ; 1); (t ; 1); I; I; Dg
SuperSAB
Using the H operator, the SuperSAB rules may be written as:
(t) = HfW (t)
W (t ; 1); (t ; 1); I; I; Dg
W (t + 1) = ;(t) rE ; HfW (t) W (t ; 1); W (t); = 0; = 0; ; g
(note that the product W (t) W (t ; 1) is used twice and it is probably better to
calculate it just once, before applying these formulas).
18.2.2
SOM/Kohonen Networks
The algorithms heavily depend over the dW=dt equation chosen to model the learning
process.
The trivial equation (3.1) is changed to:
The Riccati equation (3.5) becomes:
dW
dt
=
b T; a
1x
C
W ).
dW
dt
dW
dt
=
xT
=
xT
R
R
W.
(W x) C W
(see proof of (3.5),
322
CHAPTER 18. MATRIX OPERATIONS ON ANN
The more general equations (3.2.1) and (3.2.2):
R
C
dW
dW
T
= x
(W x)
W
and
dt
dt
W xxT ; (W x) W
C
=
The trivial model with a neuronal neighborhood and a stop condition (3.22) will be written as:
W (t + 1) = W (t) + (t)h(jx(K )
✍
C
x(K )k j)
C
T
x
R
W
Remarks:
➥
The above equations are just examples. There are many possible variations in
SOM/Kohonen networks and it's very easy to build own equations, according to
the model chosen.
18.2.3 BAM/Hop eld Networks
BAM networks
The (4.3) formulas change to:
HfW y(t); x(t); = +1; =; = ;1g
y(t + 1) = HfW x(t + 1); y(t); = +1; =; = ;1g
x(t + 1) =
T
and for working in reverse, (4.4) become:
y(t + 1) =
HfW x(t); y(t); = +1; =; = ;1g
x(t + 1) =
HfW
T
y(t + 1); x(t); = +1; =; =
;1g
The nal algorithm change accordingly.
Discrete Hop eld memory
Formula (4.6) transforms to:
y(t + 1) =
HfW y(t) + x ; t; y(t); = +1; =; = 0g
Continuous Hop eld memory
(4.10) may be written as:
y(t + 1) = y(t) +
❖
Wy + x ;
1
t
ln
y(t)
(1
C
y(t))
C
y(t) (1
C
y(t))
Here signify the element-wise division between y(t) and 1 y(t). The ln function follows
the convention used in this book for scalar functions applied to vectors: it applies to each
vector element in turn.
18.3. CONCLUSIONS
➧ 18.3
323
Conclusions
The new matrix operations seems to be justi ed by their usage across several very di erent
network architectures. The table 18.1 shows their usage.
Operation ANN architectures
R
|
R
SOM
C
momentum
C
backpropagation, momentum, SOM, continuous Hop eld
R
|
C
SOM
H
adaptive backpropagation, SuperSAB, BAM, discrete Hop eld
Table 18.1:
The usage of new matrix operations across ANN architectures.
R
R
It may be seen that two operations, i.e. and were not used in the ANN algorithms
studied here. However they were de ned because:
there are many other yet unchecked algorithms and they may nd an usage;
together with the rest of operations, they form a complete (symmetrical) system
allowing for a large variety of matrix/vector manipulations.
The fact that so di erent ANN architectures could be expressed in terms of fully matrix
equations strongly suggests that many other algorithms (if not all) may be converted to full
matrix formulas.
One other operator, the element-wise \Hadamard division" , also seems to be useful; it
represents the \opposite" of Hadamard product, possibly lling a gap in matrix operations.
The usage of matrix formulas on numerical simulations have the following advantages:
splits the diculty of implementations onto two levels: a lower one, involving matrix
operations, and a higher one involving the ANNs themselves;
leads to code reusage, with respect to matrix operations;
makes implementation of new ANNs easier, once the basic matrix operations have
been already implemented;
ANN algorithms expressed trough the new matrix operations, do not lead to unnecessary operations or waste of memory;
makes heavy optimization of the basic matrix operations more desirable, as the resulting code is reusable; see [Mos97] and [McC97] for some ideas regarding optimizations;
makes debugging of ANN implementations easier.
324
CHAPTER 18. MATRIX OPERATIONS ON ANN
In order to fully take advantage of the matrix formulas, it may be necessary to have some
supplemental support:
scalar functions when applied to vectors, do in fact apply to each element in turn.
some algorithms use the summation over all elements of a vector, i.e. a operation of
type xT 1b.
APPENDIX
A
Mathematical Sidelines
➧ A.1
Distances
A.1.1 Euclidean Distance
Let be two real vectors x; y 2 Rn of dimension n 2 N .
The Euclidean distance d between the vectors x and y is de
d=
kx ; yk
v
u n
uX
=t
(xi
i=1
❖
ned as:
; yi )2
Also, considering the vectors as column matrices then d =
p
❖d
(A.1)
xT y.
A.1.2 Hamming Distance
The Hamming space of dimension n is de ned as:
Hn = xT = ;x1 : : : xn 2 Rn jxi 2 f;1; 1g
so in an Euclidean space the Hamming space can be represented as a set of 2n points at
equal distance from origin (corners of a hyper-cube).
The Hamming distance between 2 (Hamming) points x and y is de ned as:
(
n
X
0 if xi = yi
(xi ; yi ) where (xi ; yi ) =
h = kx ; y kH =
1 if xi 6= yi
i=1
325
x, y, n
326
APPENDIX A. MATHEMATICAL SIDELINES
x3
3
r
2
x2
1
x1
Figure A.1: The generalized spherical coordinates. The angles i are
measured from the position vector of the point to the axes
of a Cartesian system of coordinates.
i.e. h represents the number of mismatched components of x and y.
✍
Remarks:
➥
➧ A.2
For 2 Euclidean vectors x and y subject to the restriction that xi ; yi 2 f;1; 1g
(
0 if xi = yi
then (see (A.1)): (xi ; yi )2 =
) h = d42 .
4 if xi 6= yi
Generalized Spherical Coordinates
Considering an n-dimensional space it is possible to de ne the position of an arbitrary point
by the means of n angles and a distance: f1 ; : : : ; n ; rg. r represents the distance from
the (current) point to the origin of the coordinates system while the angles i are measured
between the position vector and the axes of a Cartesian orthogonal system. See gure A.1.
✍
Remarks:
➥
Note that the system fr; 1 ; n g have n + 1 elements and thus the coordinates
are not independent.
By using repetitively the Pitagora theorem:
jrj2 = (r cos 1 )2 + + (r cos n )2 )
n
X
i=1
cos
2
i = 1
A.3. PROPERTIES OF SYMMETRIC MATRICES
➧ A.3
327
Properties of Symmetric Matrices
A matrix A is called symmetric if it's square (A = A ) and symmetric, i.e. the matrix is
equal with its transposed A = AT .
Proposition A.3.1. The inverse of a symmetric matrix, if exists, is also symmetric.
It is assumed that the inverse A;1 does exist. Then, by de nition, A;1 A = I , where I is the unit
ij
ji
Proof.
matrix.
For any two matrices A and B it is true that (AB )T = B T AT . Applying this result, it gives that
AT A;1T = I .
Finally, multiplying with A;1 , to the left, and knowing that AT = A, it follows that A;1T = A;1 .
A.3.1 Eigenvectors and Eigenvalues
Let assume that there are a set of eigenvectors fu g =1 and a corresponding set of eigenvalues f g =1 , such that
i
i
i
;n
;n
i
Au
i
=
u ; i = 1; n
i
(A.2)
i
The eigenvalues are found from the general equation:
A x = x ,
A ; I )x = 0b
(
If the matrix A ; I would have an inverse, then by multiplying the above equation by its
inverse it would give that x = b0, i.e. all eigenvectors are zero. To avoid this situation is
necessary to impose the condition that A ; I matrix is not inversable, i.e. its determinant
is null:
jA ; I j = 0
and this leads to the characteristic polynom of A, of the n-th degree in , whose roots are
the eigenvalues. The set of f g =1 is also named the spectrum of A.
Proposition A.3.2. If two eigen vectors are parallel then they represent the same eigen
value | assuming that they are non-zero.
i
i
;n
Let assume, by absurd, that the above proposition is not true, i.e. there are two eigenvectors
is a any non-zero constant, such that Au1 = 1 u1 and Au2 = 2 u2 and
1 6= 2 .
But then the following is also true: Au1 = 1 u1 and A u1 = 2 u1 and, by subtracting the equations it
follows that 1 ; = 1 ; 2 . Finally, for = 1 it gets that the two eigenvalues are equal.
Proof.
u1 k u2 , u1 = u2 , where
The eigenvectors are de ned up to a multiplicative constant; if a vector u is an eighenvector
then the u is also an eighenvector for the same eigenvalue (where is some constant).
Let consider two arbitrary chosen eigenvectors u and u . By multiplying (A.2) with uT
and the equation Au = u with uT :
i
i
i
i
i
j
j
j
[Bis95] pp. 440{443.
j
i
uTj Aui = i uTj ui and uTi Auj
A.3 See
j
=
uT u
j
i
j
❖ ui ,
i
328
APPENDIX A. MATHEMATICAL SIDELINES
Considering A is symmetric then uTj Aui = uTi Auj (using (AB )T = B T AT ) and uTj ui =
uT
i uj (whatever ui and uj ). By subtracting the two above equations:
(i ; j )uTi uj = 0
❖
U
Two situations arises:
i 6= j : Then uTi uj , i.e. ui ?uj | the vectors are orthogonal.
i = j : by substituting i and respectively j in (A.2) and adding the two equation
obtained, it gets that a linear combination of the two eigenvectors ui + uj is also
an eigenvector. Because the two vectors are not parallel ui , uj then they de ne a
plane and, a pair of orthogonal vectors as a linear combination, of the two may be
chosen.
The same rationing may be done for more than 2 equal eigen values.
Considering the above discussion, then the eigenvector set fui g may be easily normalized,
such that ui uj = ij , 8i; j . Also the associated matrix U , built using fui g as columns, is
orthogonal U T U = UU T = I , i.e. U T = U ;1 (by multiplying the true relation U T U = I ,
to the left, by U ;1 ).
Proposition A.3.3. The inverse of the matrix A have the same eigenvectors and the 1=i
eigenvalues, i.e. A;1 ui = 1 ui .
By multiplying (A.2) with A;1 to the left and A;1 A = I it gives u = A;1 u .
i
Proof.
✍
❖
i
i
i
Remarks:
➥ The A matrix may be diagonalized. From Aui = i ui and, by multiplying by
T
uT
j to the left: uj Aui = i ij , so, in matrix notation, it may be written as
U AU = , where =
T
1
0
!
.. . . .. . Then:
. . .
0
n
jU T AU j = jU T j jAj jU j = jU T U j jAj = jI j jAj = jAj = jj =
Proposition A.3.4.
true that:
Rayleigh Quotient.
n
Y
i=1
i
For A a square matrix and any a vector, it is
aTAa
kak2
6 max
where max = max
i is the maximum eigenvalue; Euclidean metric being used here.
i
Proof. The above relation should be una ected by a coordinate transformation. Then let use the coordinate
transformation de ned by matrix U . The new representation of vector a is then a0 = U a and respectively
a0 T = aT U T (use (AB )T = B T AT ). The (Euclidean) norm remains unchanged ka0 k2 = a0 T a0 = aT a =
kak2 (U T U = I ). Then:
aTAa
a0 TAa0
aTU T AU a
kak2
6 max
, kak2
6 max
,
kak2
6 max
A.4. THE GAUSSIAN INTEGRALS
329
X
and as U TAU = then the above inequation is equivalent with:
aT a =
i a2i 6 max kak2 = max
i
Xa
2
i
i
which, obviously, is true.
De nition A.3.1. A matrix A is positive de nite if xT Ax > 0, 8x 6= 0.
From (A.2), considering a orthonormal set of eigen vectors and by multiplying to the left
by uTi then i = uTi Aui and the eigen values are positive for positive de nite matrix.
A.3.2 Rotation
Proposition A.3.5. If the matrix U is used for a coordinate transformation, the result is a
rotation. U is supposed normated.
Let x = U T x. Then kxk2 = xT x = xT UU T x = kxk2 (use (AB )T = B T AT ), i.e. the length of
e
Proof.
e ee
e
the vector is not changed.
Let be two vectors x1 = U T x1 and x2 = U T x2 . Then xT1 x~ 2 = xT1 UU T x = xT1 x, i.e. the angle between
two vectors is preserved.
The only transformation preserving lengths and angles is the rotation.
e
e
A.3.3 Quadratic Forms
A quadratic form of type: F (x) = xT Ax where A is an arbitrary square matrix.
By using the eigenvectors of A the function F (x) becomes:
F (x) = xT Ax = xT UU T AUU T x = xeT U T AU xe = xeT xe =
n
X
i=1
i xe2i
(because UU T = I , xe = U T x and U T AU = ).
➧ A.4
The Gaussian Integrals
A.4.1 The Unidimensional Case
Let I =
1
R
;1
e;x2 dx and then I 2 =
dS = dx dy is the elementary surface.
1
R
;1
e;x2 dx
1
R
;1
e;y2 dy =
RR
R2
e;(x2 +y2 ) dS , where
By switching to polar coordinates, see gure A.2 on the following page, x2 + y2 = r2 and
dS = dr r d'; then the integral becomes:
I =
ZZ
2
A.4 See
R2
e;r2 dS =
1
Z Z2
0
[Bis95] pp. 444-447.
0
e;r2 r dr d' = 2
1
Z
0
re;r2 dr = 2
1
2
1
;
r
;2 e
=
0
330
APPENDIX A. MATHEMATICAL SIDELINES
y
dS
~r + d~r
'
d'
~r
d~r
x
Figure A.2: Polar coordinates: The surface element dS .
1
p
R
e;x2 dx = p and e;x2 dx = 2 because e;x2 is an even function (same
;1
0
value for x and ;x).
Finally
1
R
A.4.2 The Multidimensional Case
exp ; xT2Ax dx where A is a n n square and symmetric matrix and x 2 Rn
R
(dx = dx1 dx2 : : : dxn ).
Since A is symmetrical, it is possible to build a orthonormal set of eigenvectors fui gi=1;n
such that uTi uj = ij (see section A.3), and then the x vector may be written as x =
Let I =
n
P
i=1
R
n
i ui .
The change of variable from fxi gi=1;n to f i gi=1;n is done. Then xT Ax =
fi gi=1;n are the eigenvalues of A, and dx =
❖
uij
n
o
@xi
@ j ij
n
P
i=1
i 2i , where
d 1 : : : d n.
= uij | where uij is the i-th element of the vector uj | and, because the fuj gj=1;n
is orthonormal, then for the associated matrix U is true that U T U = I (the matrix is
n
o
orthogonal, see section A.3) and the Jacobian determinant jJ j = @@x
becomes:
ij
@xi
@ j
i
j
jJ j2 = jU j2 = jU T j jU j = jU T U j = jI j = 1 ) jJ j = 1
(the integrand exp ; x 2Ax is positive over all space Rn , then the integral is positive and
then the solution jJ j = ;1 is not acceptable). Finally the integral becomes:
T
n
1
n 2
2
Y
i
I=
exp ; 2 i d i =
i=1 ;1
i=1 i
Y
Z
r
A.5. THE EULER FUNCTIONS
Because jAj =
331
n
n=2
Q
i then I = (2p)jAj
i=1
A.4.3 The multidimensional Gaussian integral with a linear term
R
; xTAx + c
dx where A is a n n square and symmetric matrix,
x 2 Rn and c 2 Rn is a constant vector (dx = dx1 dx2 : : : dxn ).
Let fui gi=1;n be the set of orthonormated eigenvectors of A.
Let I =
The
c
Rn
exp
T
2
x
vector may be written by means of the eigenvectors set as
c
ci = cT ui , (as ui uj = ij ), are called the projections of c on ui .
=
n
P
ci ui where
i=1
Finally, similar to the multidimensional Gaussian integral (above) the integral may be transformed into a product of independent integrals
I=
n Z1
Y
i=1 ;1
exp
; i2 i + ci
2
i
A square is forced in the exponent: ; i2 i + ci i = ; 2i
integral becomes:
2
1
d 1:::d n
i ; cii
2
2
+ 2cii , such that the
#
"
Z
2
c 2 d
exp ; i i ; i
I = exp 2ci
i
2
i
i
i=1
;1
n
Y
A new change of variable is done: ei = i ; cii , then d ei = d i , the integral limits remain
the same, and:
n c2 ! Y
n Z1
X
e2i
i
i
I = exp
d ei
exp ;
2
i=1 2i i=1 ;1
n 2
P ci
Similar as for multidimensional Gaussian integrals: I = pjAj exp
. Because
i=1 2i
n 2
P
A;1 ui = 1i ui (see section A.3) then: cT A;1 c = cii and, nally:
i=1
(2)n=2
cT A;1 c
p
exp
I=
2
jAj
(2 )n=2
➧ A.5
The Euler Functions
A.5.1 The Euler function
The Euler function ;E (x) is de ned as being:
❖
;E
332
APPENDIX A. MATHEMATICAL SIDELINES
;E (x) =
Z1
e
;t tx;1 dt
(A.3)
0
and is convergent for x > 0.
Proposition A.5.1. For the Euler function it is true that ;E (x + 1) = x;E (x)
Proof.
Integrating by parts:
Z1
;E (x) =
e;t tx;1 dt = e;t
0
Proposition A.5.2. If
Proof.
For
n
n 2
1
1
tx 1 1
e;t t;x dt = ;E (x + 1)
+
x 0
x
x
0
Z
N then n! = ;E (n + 1) where 0! = 1 by de nition.
It is demonstrated by mathematical induction.
1
R
= 0: ;E (1) = e;t dt = ;e;t 1
0 = 1 = 0! and for
0
n
= 1, by using proposition A.5.1:
;E (2) = 1 ;E (1) = 1 = 1!.
It is assumed that n! = ;E (n + 1) is true and then:
(n + 1)! = (n + 1) n! = (n + 1);E (n + 1) = ;E (n + 2)
the equation (n + 1)! = ;E (n + 2) is also true.
A.5.2 The sphere volume in the n{dimensional space
It is assumed that the volume of a sphere of radius
proportional with the power n of the radius:
V
n
= Cn r n
;
C
n
r
into a n-dimensional space
V
n
is
= const.
where Cn is to be found.
The integral:
n
I
=
Z1
Z1
;1
;1
exp a(x21 + + x2n )
n
dx1 : : : dx
is calculated in two ways:
1. The integrals from In are decoupled such that In =
R1
;1
;
a
= const.
!n
;ax2 dx
e
=
; n=2
.
a
2. The change of variable from Cartesian coordinates to generalized spherical coordinates
is performed:
2
x1
+ + x2n = r2
;
n
dx1 : : : dx
= dVn = nCn rn;1 dr
where the elementary volume in spherical coordinates may be assumed as an in nitesimal spherical layer, due to the symmetry of the integrand relatively to origin. Then
R1 n;1 ;ar2
In becomes: In = nCn
r
e
dr .
0
A.6. THE LAGRANGE MULTIPLIERS
x2
333
rg
r? f
rf
g (x) = 0
rkf
x1
Figure A.3: The gradient vectors rf
space.
and rg into a bidimensional
8
<dr
= 2p1 a x;1=2 dx
A new change of variable is performed: ar2 = x ) : n;1
n;1
r
= n1;1 x 2
a 2
integral becomes:
nC
In = n=n2
2a
Z1
xn=2;1 e;x dx =
0
and the
n
n
Cn
nCn
=
+
1
;
;
E
E
2
2
2an=2
an=2
n=2
By comparing the two results Cn = ;E( n +1) and nally:
2
Vn =
➧ A.6
;E
n=2
;n
2
rn
+1
The Lagrange Multipliers
The problem is to nd the stationary points of a function f (x) subject to a relation between
the components of vector x, given as g(x) = 0.
Geometrically, g(x) = 0 represents a surface in the X n space, where n is the dimension of
that space. At each point rg represents a vector perpendicular on that surface and rf
may be expressed as rf = rk f + r? f , where rk f is the component parallel with the
surface g(x) = 0 and r? f is the component perpendicular on it. See gure A.3.
✍
Remarks:
➥
Considering a point in the vicinity of x, on the g(x) surface, such that it is de ned
by the vector x + ", where " lies within the surface de ned by g(x) = 0, then
the Taylor development around x is:
r
g (x + ") = g (x) + "T g (x)
A.6 See
[Bis95] pp. 448{450.
❖
r k , r?
334
APPENDIX A. MATHEMATICAL SIDELINES
and, on the other hand, g(x + ") = g(x) = 0 because of the choice of ".
Then "T rg(x) = 0, i.e. rg(x) is perpendicular on the surface g(x) = 0.
Lagrange
multiplier
As r? f k rg (see above), then it is possible to write r? f = ;rg where is called the
Lagrange multiplier or undetermined multiplier, and rk f = rf + rg.
The following Lagrange function is de ned:
L x;
(
) =
f
x) + g (x)
(
such that rL = rk f and the condition for the stationary points of f is rL = 0b.
For the n-dimensional space, the rL = 0b condition gives n + 1 equations:
@L
@xi
= 0
;i
= 1
; n and @L
@ g x
=
(
) = 0
and the constraint g(x) = 0 is also met.
More general, for a set of constrains gi (x) = 0, i = 1; m, the Lagrange function have the
form:
L x; 1 ; : : : ; m
(
➧ A.7
) =
f
x) +
(
m
X
i=1
i gi x
(
)
Useful Mathematical equations
A.7.1 Combinatorics
Let consider N di erent objects. The number of ways it is possible to choose n objects out
of the N set is:
N N
n
N ;n n
!
(
Considering the above expression then:
N;
n
1
=
(
)!
1)!
!
N ;n
N ;n; n
(
)!
!
=
N ;n N
N n
=
N ; N;
n
n;
1
1
;
; ;
representing the recurrent formula: Nn+1 = Nn + nN;1 .
A.7.2 The Jensen's inequality
Let consider a convex function, i.e. a function for which all points from a chord, between
any two points of the graph, are \below" the graph of the function. See gure A.4 on the
facing page.
A.7 See
[Str81] pp. 200{201.
A.7. USEFUL MATHEMATICAL EQUATIONS
335
f (x)
b; f (b)
f (b)
f (x)
d(x)
f (a)
a; f (a)
a
x
x
b
Figure A.4: A convex function f . A chord between arbitrary points a
and b is under the graph of the function.
Proposition A.7.1. Considering a convex function f . a set of N > 2 points fx g =1
and a set of N numbers f g =1
i
i
P
such that
i
N
;N
f
=1
=1
i
i
X
i
=1
! X
f (x )
x >
i
;N
> 0, 8i then it is true that:
N
N
i
and
i
i
i
i
i
=1
which is called Jensen's inequality.
Let rst consider two points a and b and two numbers 0 6 t 6 1 and 1 ; t; such that they respect
the condition of the theorem.
The points (a; f (a)) and (b; f (b)) de nes a chord whose equation is (equation of a straight line passing
trough 2 points):
Proof.
f (a)
d(x) = bf (ab) ;; aaf (b) + f (bb) ;
;a x
then, for any x 2 [a; b] it will be true that d(x) 6 f (x). See also gure A.4.
By expressing x in the form of x = a + t(b ; a), t 2 [0; 1], and replacing in the expression of d(x), it gives:
f (a) + t[f (b) ; f (a)] 6 f [a + t(b ; a)] , f [(1 ; t)a + tb] > (1 ; t)f (a) + tf (b)
i.e. the Jensen's inequality holds for two numbers (t and 1 ; t).
Let c be a point inside the [a; b] interval, f;0 (c) the derivative
of f (x) at c and f+0 (c) the
derivative
of f (x) at the same point c. For a continuous derivative in c they are equal:
f;0 (c) = f+0 (c) = f 0 (c).
to the left
to the right
The expression f (xx);;cf (c) represents the tangent of the angle between the chord | passing trough the
points (x; f (x)) and (c; f (c)) | and the Ox axis. Similarly f 0 (c) represents the tangent of the angle made
by the tangent.
Let m be a number f;0 (c) 6 m 6 f+0 (c). Because f is convex then it is true that:
f (x) ; f (c) > m for x < c
x;c
and
f (x) ; f (c) 6 m for x > c
x;c
see also gure A.5 on the following page.
Finally, from the above equations, it is true that f (x) 6 m(x ; c) + f (c), 8x 2 [a; b].
Jensen's inequality
336
APPENDIX A. MATHEMATICAL SIDELINES
;
;
f (x) f (c)
x c
fx
(
)
f;0 c
f0 c
( )
+( )
x<c
c
x>c
x
Figure A.5: A convex function f , its derivatives in point c | to the
left: f; (c); and to the right f+ (c). The chords for x < c
and respectively for x > c are drawn with dashed lines.
Parameters f;0 (c), f+0 (c) and f (xx);;fc (c) are the tangents
of the angles between the tangents in c, respectively the
chords, and the Ox axis.
Considering now a set of numbers fx g =1
i
N
P
i=1
i
= 1 then: a 6 xi 6 b )
Let c =
N
P
i=1
i
i
;N
i
2 [a; b] and a set of parameters f g =1 2 [0; 1] such that
i
i
;N
a 6 x 6 b and after a summation over i: a 6
i
i
i
N
P
i=1
i
x 6 b.
i
x 2 [a; b], then:
i
f (x ) 6 m(x ; c) + f (c) )
i
i
i
f (x ) 6 m( x ; c) + f (c)
i
i
i
i
i
and the Jensen's inequality is obtained by summation over i = 1; N .
A.7.3 The Stirling Formula
Proposition A.7.2.
For
n 2 N n
,
ln
Proof.
1 it is true that:
n ' n n ; n n ne
!
ln
=
ln
1
R
The Euler function ; (x + 1) e; t dt (see (A.3)) is estimated for x ! 1 by the method of
E
t x
0
saddle point | the integrand is developed in series around maximum and then the superior order terms are
neglected.
The derivative of integrand e; t = exp(;t + x ln t) is zero at maximum:
t x
d [exp(;t + x ln t)] = 0 , ;1 + x exp (;t + x ln t) = 0
dt
t
i.e. maximum is at point t = x (because the exp is never 0).
The exponent is developed in series around t = x:
2
2
2
;t + x ln t = ;x + x ln x + (t ;2!x) dtd 2 (;t + x ln t) + : : : ' ;x + x ln x ; (t ;2xx)
=
t
x
A.8. CALCULUS OF VARIATIONS
337
because dtd (;t + x ln t) t=x = 0, and just the rst term from the series development is kept. Then:
Z1
2
;E (x + 1) ' exp(x ln x ; x)
; (t ;2xx)
exp
0
and the t ! t ; x change of variable is performed and then:
Z1
;E (x + 1) ' exp(x ln x ; x)
;x
and by using the limit x ! 1:
;E (x + 1) ' exp(x ln x ; x)
1
Z
;1
exp
exp
2
; 2tx
2
; 2tx
dt
dt
dt
and nally another change of variable s = pt2x (see also section A.4):
Z1
p
p
p
2
;E (x + 1) ' 2x exp(x ln x ; x)
e;s ds = 2x exp(x ln x ; x)
;1
p
For large x = n 2 N : ln ;E (n + 1) = ln n! ' n ln n ; n + ln 2n ' n ln n ; n.
➧ A.8
Calculus of Variations
The change in a function f (x) when the variable x changes by a small amount x is:
f =
df
dx
x +
O(x2 )
where O(x2 ) represents the superior terms (depending at most as x2 ).
For a function of several variables f (x1 ; : : : ; xn ) the above expression becomes:
X @f x
n
f =
i=1
@xi
i +
O(x2 )
A functional E [f ] is a form which takes a function as variable and returns a value.
Considering an arbitrary function f (x), which have small values everywhere, then the
variation of E is (by similarity with f , ! ):
E = E [f + f ]
P R
Z
; E [f ] =
X
X
E
f (x)
f (x) dx +
O(f 2 )
(A.4)
being the space of x.
Proposition A.8.1. The fundamental lemma of the calculus of variations. The con-
dition of stationarity for E , to the lowest order in
assuming the continuity of E .
f ,
involves the requirement
E
f = 0
,
338
APPENDIX A. MATHEMATICAL SIDELINES
Proof.
Stationarity, to the lowest order, involves E = 0 and O(f 2 ) ' 0. (A.4) gives
Let assume that there is an xe for which
whole vicinity [xe1 ; xe2 ] of xe, such that
6
E
= 0
f x
e
(
R
X
6
E
f f dx = 0
✍
=
X
E f dx = 0
f
.
. Then the continuity condition implies that there is a
6
E
=0
f x2[x
e1 ;x
e2 ]
As f is arbitrary then it is chosen as f
R
and keeps its sign.
6= 0 and keeps its sign
=0
and the lemma assumptions are contradicted.
x
2 [xe1 ; xe2 ] . Then, it follows that
in rest
Remarks:
➥
A functional form, used in regularization theory, is:
Z "
2
E [f ] =
f +
df
2 #
X
(A.5)
dx
dx
Then, by replacing f with f + f in the above equation:
Z
E [f + f ] = E [f ] + 2
f f +
X
The term
R df d(f )
X
dx
dx
dx,
df d(f )
dx
dx
integrated by parts, gives
Considering the boundary term equal to
comes:
Z
E =
0
d2 f
; 2 dx
2f
2
X
df
( dx
dx +
df
dx
f
X boundaries
f dx +
O(f
2
Z
D
d
dx
)
; ddx2 f2 f .
) then (A.5) be-
)
E
f = 2f
; 2 ddx2 f2 .
then the functional and its derivative may be
f 2 + (Df )2 dx
E=
2
X boundaries
= 0
By comparing the above equation with (A.4) it follows that
De ning the operator
written as:
O(f
and
X
E
f
b
= 2f + 2DDf
where Db = ; dfd is the adjoint operator of D.
➧ A.9
Principal Components
Let consider, into the space X of dimensionality n, a set of orthonormated vectors fui gi=1;n :
uTi uj = ij
(A.6)
A.9. PRINCIPAL COMPONENTS
339
N
P
and a set of vectors fxi gi=1;N , having the mean hxi = N1
i=1
The residual error E is de ned as:
E = 21
n X
N
X
i=K +1 j =1
2
uTi (xj ; hxi)
; K = const.
❖ X , n,
xi , hxi
❖E
xi .
N , ui ,
2 [0; 1; : : : ; n ; 1]
and may be written as (use (AB )T = B T AT matrix property):
E = 21
n
X
i=K +1
uTi ui where =
N
X
i=1
(xi ; hxi)(xi ; hxi)T
being the covariance matrix of fxi g set.
❖
❖
U, M
❖
i,
The problem is to nd the minima of E , with respect to fui g, subject to constraints (A.6).
This is done by using a set of Lagrange multipliers fij g. Then the Lagrange function to
minimize is:
L = E ; 12
n
X
n
X
i=K +1 j =K +1
ij (uTi uj ; ij )
(because of symmetry uTi uj = uTj ui then ij = ji and each term under the sums appears
twice and thus the factor 1=2 is inserted to count each di erent term once).
Let consider the matrices:
;
U = uK +1 : : : un
and M =
K +1;K +1
..
.
n;K +1
...
K +1;n !
..
.
n;n
U being formed using ui as columns and M being a symmetrical matrix. Then the Lagrange
function becomes:
;
;
L = 21 Tr U T U ; 21 Tr M U T U ; I
Minimizing L with respect to ui means the set of conditions
(use (AB )T = B T AT matrix property):
@L
@uij
= 0, i.e. in matrix format
( + T )U ; U (M + M T ) = e0 ) U = UM
and, by using the property of orthogonality of fui g, i.e. U T U = I , it becomes:
U T U = M
(A.7)
One particular solution of the above equation is to choose fui g to be the eigenvectors
of (as is symmetrical it is possible to build an orthogonal system of eigenvectors)
and to choose M as the diagonal matrix of eigenvalues (i.e. ij = ij i where i are the
eigenvalues of ).
An alternative is to consider the eigenvectors of M : f i g and the matrix built by using
them as columns. Let be the diagonal matrix of eigenvalues of M , i.e. fgij = ij i .
As M is symmetric, it is possible to choose an orthogonal set f i g, i.e. T = I .
, , i
340
APPENDIX A. MATHEMATICAL SIDELINES
From eigenvector equation M
= TM .
By replacing M from (A.7)
=
, and by multiplying to the right by
T:
= T U T U = (U )T (U ) = Ue T Ue
e
❖U
where Ue = U .
This means that if there are a particular solution U to (A.7) then Ue = U is also solution:
Ue T Ue = M
and the residual error may be written as:
;
E = 21 Tr U T U = 21 Tr(M ) = 12 Tr Ue T Ue
✍
Remarks:
➥
There is an invariance to the orthogonal transformation de ned by .
APPENDIX
B
Statistical Sidelines
➧ B.1
Probabilities
B.1.1 Probabilities and Bayes Theorem
Let consider some pattern vectors fxp g and some classes fCk g these patterns have to be
classi ed into.
De nition B.1.1. The prior probability P (Ck ) represents the probability of a pattern as
being of class k while belonging to a very large set of samples:
number of patterns of class Ck
2 [0; 1]
(B.1)
P (Ck ) =
total number of patterns
when \total number of patterns" ! 1.
De nition B.1.2. The join probability P (Ck ; X` ) represents the probability of a pattern
as being of class k and | at the same time | the pattern vector being in the pattern
subspace X` X ; the pattern belonging to a very large set of samples.
number of patterns of class Ck with x 2 X`
2 [0; 1]
(B.2)
P (Ck ; X` ) =
total number of patterns
when \total number of patterns" ! 1.
✍
Remarks:
➥
B.1.1
For discrete pattern spaces x 2 X` may be replaced with x 2 fX`1; : : : g, where
X`1 , : : : are also pattern vectors.
See [Bis95] pp. 17{28 and [Rip96] pp. 19{20, 75.
341
❖
xp , Ck
❖
P(
❖
P(
Ck )
Ck
; X` )
342
APPENDIX B. STATISTICAL SIDELINES
➥
❖
P (X `
jC
k)
For continuous pattern spaces either X` de nes a volume in pattern space, in
which the point x should be, or a point in which case x 2 X` is replaced with
x = X` but, in this case, P (Ck ; X` ) represents an in nitesimal quantity.
for a pattern of class
by X` .
P (X`
❖
P (X ` )
P (X` jCk ) represents the probability
to have its pattern vector in the pattern subspace area de ned
De nition B.1.3. The class-conditional probability
jC
C
k
k) =
number of patterns of class Ck with x 2 X`
total number of patterns of class Ck
P(
Cj
k X` )
[0; 1]
(B.3)
when \total number of patterns of class Ck " ! 1.
De nition B.1.4. The distribution probability P (X` ) represents the probability of a pattern to have its associated vector x in the subspace X`.
P (X` )
❖
2
=
number of patterns with x 2 X`
total number of patterns
2
(B.4)
[0; 1]
when \total number of patterns" ! 1.
De nition B.1.5. The posterior probability P (Ck jX` ) represents the probability for a
pattern which have its associated vector in subspace X` to be of class Ck :
Cj
P(
k X` ) =
number of patterns with x 2 X` and of class Ck
total number of patterns with x 2 X`
2
[0; 1]
(B.5)
when \total number of patterns with x 2 X`" ! 1.
✍
Remarks:
➥
➥
➥
Regarding X` and probabilities same previous remarks apply.
The the prior probability refers to knowledge available before the pattern vector
is known while the posterior probability refers to knowledge available after the
pattern vector is known.
By assigning a pattern to a class for which the posterior probability is maximum,
the errors of misclassi cation are minimized.
Theorem B.1.1. Bayes. The posterior probability is the normalized product between prior
and class-conditional probabilities:
Cj
P(
P (X` )
Proof.
k X` ) =
P ( X`
jC
C
k) P ( k)
P (X` )
(B.6)
being the normalization factor.
By multiplying (B.3) and (B.1) and comparing the result with (B.2) it follows that
Ck ; X` ) = P (X` jCk ) P (Ck )
(B.7)
Ck ; X` ) = P (Ck jX` ) P (X` )
(B.8)
P(
similarly, from (B.5) and (B.4)
P(
The nal result is obtained by comparing (B.7) and (B.8).
B.1. PROBABILITIES
✍
343
Remarks:
➥
➥
When working with classi cation, all patterns belong to a class: if a pattern can
not be classi ed into a \normal" class the there may be a outliers class containing
all patterns not classi able in any other class.
P (X` ) represents the normalization factor of P (X` jCk ) P (Ck ).
Proof.
Because each pattern should be classi ed into a class then
K
X
P C jX
(
k=1
(B.9)
k `) = 1
By using the Bayes theorem (B.6) in (B.9) the distribution probability may be expressed as:
P (X` ) =
` k ) P (Ck )
(
k=1
PK P Ck jX`
and then P (Ck jX` ) is normalized, i.e.
B.1.2
K
X
P X jC
(
k=1
)=1
.
Probability Density, Expectation and Variance
De nition B.1.6.
The
probability density function p(x)
Z
P (X
`) =
is the function for which
(B.10)
x) dx
p(
X`
probability
density
` is a pattern subspace.
Similarly, the following probability densities may be de ned:
Join probability density p(Ck ; x):
where
X
Z
(Ck
Ck ` ) =
X
P(
p
;X
;
x) dx
`
Class-conditional probability density
P (X
p(
`jCk ) =
Posterior probability density (Ck jx):
p
xjCk ):
Z
X`
p(
xjCk ) dx
Z
(Ck jx) x
Ck j ` ) =
X
De nition B.1.7. The expectation (expected value) Ef g of a function
P(
p
X
d
`
Ef g =
Q
The
variance VfQg of
a function
Vf g =
Q
Q(
Q
Z
X
Q(
x)p(x) dx
x) is:
sZ
X
[Q(x)
; Ef g]2
Q
p(
x) dx
Q(
x) is:
expectation
344
APPENDIX B. STATISTICAL SIDELINES
Proposition B.1.1. Using probability densities, the Bayes theorem B.1.1 may be written
as:
P (C jx) =
k
p(xjC ) P (C
p(x)
k
k)
(B.11)
Proof. For X` being a point in the pattern space, (B.10) may be rewritten as:
x) = p(x) dx
dP (
and similarly with the other types of probabilities; the nal formula is obtained doing the replacement
into (B.6).
As in Bayes theorem, p(x) is a normalization factor for P (C jx).
k
Proof. The p(x) represents the probability density for the pattern vector of being x no matter what class,
it represents the sum of join probability densities for the pattern vector of being x and the pattern being of
class Ck , over all classes:
p(
XK
x) =
p(
Ck ; x) =
XK
xjCk ) P (Ck )
p(
k=1
k=1
and comparing to (B.11) it shows that (Ck jx) is normalized.
P
✍
Remarks:
➥
The p(xjC ) probability density may be seen as the likelihood probability that
a pattern of class C will have its pattern vector x. The p(x) represents a
normalization factor such that the sum of all posterior probabilities sum to one.
Then the Bayes theorem may be expressed as:
likelihood prior probability
posterior probability =
normalization factor
k
k
➧ B.2
Modeling the Density of Probability
Let be a training set of classi ed patterns fx g1 . The problem is to nd a good approximation for probability density starting from the training set. Knowing it, from the Bayes
theorem and Bayes rule1 it is possible to built the device able to classify new input patterns.
There are several approaches to this problem:
The parametric method: A speci c functional form is assumed. There are a small
number of tunable parameters which are optimized such that the model ts the training
set. Disadvantages: there are limits to the capability of generalization: the functional
forms chosen may not generalize well.
The non-parametric method: The form of the probability density is determined from
the data (no functional form is assumed). Disadvantages: The number of parameters
grow with the size of the training set.
The semi-parametric method: Tries to combine the above 2 methods by using a very
general class of functional forms and by allowing the number of parameters to vary
independently from the size of training set. Feed-forward ANN are of this type.
p
1
See \Pattern Recognition" chapter.
;P
B.2. MODELING THE DENSITY OF PROBABILITY
345
B.2.1 The Parametric Method
The parametric method uses functions with few tunable parameters to model the probability
density. These functions are named distributions. The most widely used is the Gaussian
due to its properties and good approximation of many real world processes.
Gaussian Unidimensional
For a unidimensional space the Gaussian distribution is de ned as:
p(x) =
p1
2
exp
2
; (x 2;2)
; ; = const.
(B.12)
This function have the following properties:
1. p(x) is normalized, i.e.
R1
;1
p(x) dx = 1.
2. Expected value of x is , i.e. Efxg =
R1
xp(x) dx = .
;1
3. The variance (standard deviation) of x is , i.e.
v
u Z1
u
Vfxg = u
t [x ; Efxg]2 p(x) dx =
;1
Proof.
1.
By making the change of variable:
y=
x;
p2
,
dy =
pdx2
1
1
R x2
R
p
e dx = (see the mathematical appendix) then
p(x) dx = 1 i.e. the probability
;1
;1
density is normalized (this is the role of the p21 factor) as it should be, because the probability of nding
x in the whole space is 1 (certainty).
2. The mean value of x is the expectation (see de nition B.1.7)
Z1
and because
Efxg =
x p(x) dx
;1
p
and by making the same change of variable as above (x = 2 y + )
Z1 p
p
Efxg = p21 ( 2 y + )e;y2 2 dy
;1
r
Z1
Z1
2
2
2
ye;y dy + p
e;y dy =
=
;1
;1
1 ;y2
R
ye dy is 0 | the integrand is an odd function (the value for ;x is minus
because the rst integral
;1
p
the value for x) and the integration interval is symmetric relatively to origin; and the second integral is
(see the mathematical appendixes).
B.2.1
See [Bis95] pp. 34{49 and [Rip96] pp. 21, 30{31.
346
APPENDIX B. STATISTICAL SIDELINES
3.
The variance is (see de nition B.1.7):
Vfxg =
1
Z
(x ; )2 p(x) dx
;1
(as Efxg = ) and same change of variable leads to an integral solvable by parts:
Z1
2
2 Z1
Vfxg = p21 (x ; )2 exp ; (x 2;2) dx = 2p y2 e;y2 dy
;1
;1
1
2 Z
2
2 Z1
1
= ; p y d e;y2 = ; p y e;y2 ;1 + p e;y2 dy = 2
;1
;1
✍
Remarks:
➥
To apply to a stochastic process calculate the mean and variance of training set,
then replace in (B.12) to nd an approximation of real probability density by an
Gaussian.
Gaussian Multidimensional
In general, into a N -dimensional space, the Gaussian distribution is de ned as:
p(x) =
❖ dx
(x ; )T ;1 (x ; )
exp ;
2
jj
1p
(2)N=2
(B.13)
where is a N -dimensional vector and is a N N matrix, symmetric and inversable.
This function have the following properties:
R
1. p(x) is normalized, i.e. RN p(x) dx = 1, where dx = dx1 : : : dxN .
R
2. Expected value of x is , i.e. Efxg = RN xp(x) dx = .
3. Expected value of (x ; )(x ; )T is , i.e.
Ef(x ; )(x ; )T g =
1 , and, by making
Because det(;1 ) det() = det(;1 ) = det(I ) = 1 then det(;1 ) = det()
the change of variable xe = x ; , the integral become:
p
T ;1
Z
xe xe
j;1 j Z
Proof.
1.
RN
p(x) dx =
(2)N=2 N exp
R
2
dxe = 1
(see also the mathematical appendix regarding Gaussian integrals).
2. The mean value for x is the expectation
p
Z
(x ; )T ;1 (x ; )
j;1 j Z
Efxg =
RN
xp(x) dx =
(2)N=2 N
R
x exp
;
2
and, by making the same change of variable as above xe = x ; , it becomes
2
dx
3
Z
T ;1
T ;1
;1 Z
Efxg = (2j)N=2j 64 xe exp ; xe 2 xe dxe + exp ; xe 2 xe dxe75
p
RN
RN
B.2. MODELING THE DENSITY OF PROBABILITY
;1
347
The function exp ; xe 2 xe is even (same value for xe and ;xe), such that the integrand of the rst
integral is an odd function and, the interval of integration being symmetric to the origin, the integral is
zero. The second integral have the value (2)n=12 pjj and, as j;1 j = j1 j , then nally Efxg = .
3.
T
The expectation value for that matrix is:
Z
Ef(x ; )(x ; )T g = (x ; )(x ; )T p(x) dx
=
RpN
j;1 j Z
T exp ; (x ; )T ;1 (x ; ) dx
(
x
;
)(
x
;
)
2
(2)N=2 N
R
Same change of variable as above xe = x ; and:
p ;1
Ef(x ; )(x ; )T g = (2j)N=2j
Z
RN
xexeT exp
xeT ;1 xe
e
;
dx
2
Let fui gi=1;N be the eigenvectors and fi gi=1;N the eigenvalues of such that ui = i ui . The consider
the set of eigenvectors chosen such that is orthonormated (see the mathematical appendix regarding the
properties of symmetrical matrices).
Let be U the matrix build using the eigenvectors of as columns and the matrix of eigenvalues ij = ij i
(ij being the Kronecker symbol), i.e. is a diagonal matrix with eigenvalues on main diagonal and 0 in
rest.
By multiplying Ef(x ; )(x ; )T g with U T to the left and with U to the right and because the set of
eigenvectors is orthonormated then U T U = I ) U ;1 = U T ) U U T = I and it gets that:
pj;1 j Z
T U exp ; xe T U U T ;1 U U T xe dxe
e
e
U T Ef(x ; )(x ; )T gU =
U Tx
x
2
(2)N=2 N
R
A new change of variable is performed: y = U T xe and then yT = xeT U and dy = dxe | because
this transformationn conserve
the distances and angles. Also ;1 have the same eigenvectors as , the
o
1
1
eigenvalues being 1i i=1;N and respectively the eigenmatrix ;1 is de ned by ;
ij = ij i . Then:
UT
p
T ;1
;1 Z
Ef(x ; )(x ; )T gU = (2j)N=2j yyT exp ; y 2 y dy
=
RN
pj;1 j Z
(2)N=2
RN
yyT exp
;
N y2 !
X
i
i=1 2i
dy
=
pj;1 j Z
(2)N=2
i.e. the integrals now may be decoupled. First the yyT is of the form
0 y1 1
yyT = B
@ ... CA ;y1
0
B
yN = @
yN
y12
yN y1
..
.
..
.
RN
yyT
y1 yN
2
yN
N
Y
2
exp ; yi
2i
i=1
dy
1
CA
Each element of the matrix fU T Ef(x ; )(
x ; R)(t) gU gi;j is computed separately. There are two cases:
R
1 .
non-diagonal and diagonal elements. Also R, ;1
The non-diagonal elements are
fU T Ef(x ; )(x ; )T gU g i;j
i6=j
2
! 3
! 372 Z1
pj;1 j 6 Y
y2 32 Z1
N Z1
2
yj2
yk
i
7
6
5
4
4
yj exp ;
yi exp ;
dyi
dyj 5
dyk 5
exp ;
=
2
2i
2j
(2)N=2 4
k=1
k6=i;j ;1
k
;1
;1
❖
ui , i
❖ U,
348
APPENDIX B. STATISTICAL SIDELINES
2
and because the function yi exp ; 2yii is odd, the corresponding integrals are 0 such that
fU T Ef(x ; )(x ; )T gU g i;j = 0
i6=j
The diagonal elements are
32
2
!
p ;1 6 Y
y2
N Z1
2
7 Z1
y
j
j
fU T Ef(x ; )(x ; )T gU gi;i = (2)N=2 64
exp ; k dyk 75 4 yi2 exp ; i
2k
2i
k=1 ;1
;1
3
dyi 5
k6=i
and the individual integrals appearing above are:
!
y2
Z1
Z1
p
p
yk2
yi2 exp ; i
exp ; 2 dyk = 2k and
dyi = i 2i
2
i
k
;1
;1
(calculated same way as for the unidimensional case) and jj =
U
✍
T
N
Q
i ; so nally
i=1
Ef(x ; )(x ; ) gU = ) E [(x ; )(x ; )T ] = U U T =
(t)
Remarks:
➥
By applying the transformation (equivalent to a rotation)
xe = U T (x ; )
the probability distribution p(x) becomes (similar to the above calculations)
p ;1
xeT U T (;1) U xe pj;1 j xeT ;1xe
j
j
= (2)N=2 exp ; 2
=
p(x) =
2
(2)N=2 exp ;
! N
p ;1
N
Y
X
j
j
xe2i
= (2)N=2 exp ; 2 = pi
i
i=1
i=1
where pi is
N
xei
1 p exp ; X
pi =
N=
2
(2)
2
2
➥
i
i=1
!
i
and then the probabilities are decoupled, i.e. the components of x are statistically
independent.
The expression
q
= (x ; ) ; (x ; )
Mahalanobis
distance
➥
T
1
is called Mahalanobis distance between vectors x and .
For = const. the probability density is constant so represents surfaces of
equal probability for x.
B.2. MODELING THE DENSITY OF PROBABILITY
x2
u2
349
p
2
p
u1
1
x1
Figure B.1:
➥
Equal probability density for Gauss probability density in
two dimensions. The ellipse represents all points where
p(x1 ; x2 ) = e;1=2 . The vector points to the center of
the ellipse.
e
By applying the transformation x = U T (x ; ) the Mahalanobis distance becomes:
2 =
Xe
N
i=1
i x2i
i.e. the surfaces of equal probability
are hyper-ellipsoids. The main axes of the
p
ellipsoid are proportional with i . The vector points to the location of highest
probability density. See gure B.1
The transformation x = U T (x ; ) is from (x1 ; x2 ) to fu1 ; u2 g, i.e. a translation
by then a rotation such that fu1 ; u2 g becomes the new set of versors.
The probability density for a two dimensional pattern space is shown in gure B.2
on the following page.
The number of parameters de ning the Gaussian distribution is 1 + + N =
N (N +1)
for (symmetrical matrix) plus N parameters for so the Gaussian
2
distribution is completely de ned by N (N2+3) number of parameters.
e
➥
De nition B.2.1. A N -dimensional vector x it is said to be normal, i.e. x NN f; g if
it have a Gaussian distribution of the form (B.13) with a mean and covariance matrix .
✍
Remarks:
➥
Let consider a set of several classes 1; : : : ; K such that each may be modelled
using a multidimensional Gaussian, each with its own k and k . Considering
that the prior probabilities P (Ck ) are equal then the biggest posterior probability
for a given vector x is the one corresponding to the minimum of the Mahalanobis
distance k x = min
` x This type of classi cation is named linear discriminant
`
analysis.
❖ NN f; g
linear discriminant
analysis
350
APPENDIX B. STATISTICAL SIDELINES
0 12
p
:
0 06
x
:
0 00
5
:
0
0
;5 ;5
x2
5
5
3
1
2
;1
;3
;5
p
0 10
0 08
0 06
0 04
0 02
0 001
:
:
:
:
:
:
; 5 ;3 ; 1 1
x1
x1
a) p(x1 ; x2 )
3
5
b) level curves for p
Figure B.2:
The Gaussian probability density function in two dimensional pattern space for = 0b and = ( 10 02 ). The
function was sampled in x1 2 = 0:5 steps.
;
➥
Considering the covariance matrix equal to the identity matrix, i.e. = I , then
the Mahalanobis distance reduces to the simple Euclidean distance (see mathematical appendix) and then the pattern vectors x are simply classi ed to the
class C with the closest mean .
k
k
B.2.2 The non-parametric method
Non-parametric method try to solve the problem of nding a probability distribution of data
using a training set and without making any assumption about the form of the distribution
function.
Histograms
❖
Xk
In the histogram method the pattern space is divided in subspaces (areas) X and the
probability density is estimated from the number of patterns in each area. See gure B.3
on the next page.
The size of X determine the model complexity: if it is too large then it ts poorly the data,
if it is too small then it over ts the exceptions/noise which may be present in the training
set, i.e. the size of X controls the model complexity. See gure B.3 on the facing page.
Let take one area X and let K be the number of patterns from the training set which are
in the area X .
Assuming a suciently large number of patterns in the training set (such that the training
set is statistically signi cant) then the probability that a pattern will fall in the area X is
approximatively:
k
k
k
❖
K
k
k
k
( )'
P Xk
B.2.2
See [Bis95] pp. 49{59 and [Rip96] pp. 190, 201{206.
K
P
B.2. MODELING THE DENSITY OF PROBABILITY
351
p
p
X1
X
X
X2
a) Simple model
b) Complex model
p
X
b) Good model
Figure B.3:
Histograms: probability density p versus pattern space X .
The true probability density is shown with a dotted line.
The estimated probability density is shown with a thick
line. The rectangles represent the number of patterns
from the training set which falls in the designated regions
Xk . The small circles represents the patterns.
On the other hand, assuming that the probability density is approximatively constant to
x) in all points from Xk then:
pe(
Z
P (Xk ) = p( ) d
x
x
Xk
p
Z
' e(x)
Xk
d
x =
pe( )VXk
x
V
where VXk is the volume of the pattern area Xk . Finally:
p(
x)
p
' e(x) =
KP
PVXk
❖ Xk
(B.14)
where pe(x) is the estimated probability density.
To solve (B.14) further, two approaches may be taken.
Xk , respectively VXk is xed and KP is counted | this represents the kernel based
method.
K is xed and the VXk is calculated | this represents the K{nearest{neighbors
method.
p
❖ e(x)
kernel method
K nearest
neighbors
352
APPENDIX B. STATISTICAL SIDELINES
p(x)
X
Figure B.4:
The kernel{based method. The dashed line is the real
probability density p(x); the continuous line is the estimated probability density pe(x) based on counting the patterns in the hypercube sorounding the current point and
represented by a dashed rectangle.
The kernel{based method
❖
H (x)
Let Xk be a hypercube having the side of length ` and being centered in x. Then its volume
is VXk = `N (N being the dimension of the pattern space).
The following kernel function2 H (x) is de ned
(
H (x)
=
1
0
if xi < 12 for i = 1; N
otherwise
such that H (x) is 1 ifthe point
x is inside the unit hypercube centered in the origin and
x;xp
will indicate if the xp point from the training set is in the
0 otherwise. Then H
`
hypercube Xk or not. The total number of patterns falling in Xk is:
P x ; x
X
p
K =
H
`
p=1
and then, the estimate for the probability density is:
P
x ; xp
1 X 1
p
e(x) =
NH
P
`
p=1 `
this may be visualized as a sliding hypercube in the pattern space, centered in the current
point x. While moving it, some of the xp points will enter it while others will leave it such
that | unless the total number remains constant | pe(x) will have a step jumps. See
gure B.4.
The function pe(x) may be \smoothened" by replacing the kernel function with a continuous
2 Known
also as the
Parzen window.
B.2. MODELING THE DENSITY OF PROBABILITY
function:
1
exp
H (x) =
(2)N=2
353
; kxk
2
2
and then the estimation of the probability density becomes:
pe(x)
1
=
P
P
X
1
exp
(2
`2 )N=2
p=1
; kx ; xp k
2
2`2
In general any function bounded by the conditions:
H (x)
Z
> 0 and
H (x) dx
=1
X
is suitable to be used as a kernel function.
Let examine the expectation of the estimated probability density, considering that P ! 1:
Efpe(x)g = P1
P
X
p=1
E
1
NH
`
x ; xp
`
!
Z
1
NH
X
`
x ; x0
`
p(x0 ) dx0
(where x0 represents the integral variable).
This formula shows that the expectation of the estimated probability density is a convolution
of the (true) probability density with the kernel function.
For ` ! 0 and P ! 1 the estimated probability density approaches the true one while the
kernel function approaches the -Dirac function.
The K{nearest{neighbors method
Let K be xed and Xk a hyper-sphere centered in x and with variable radius, such that it
will contain always the same number K of vectors from the training set. See gure B.5 on
the following page.
The estimation of probability density is found from:
pe(x)
=
K
P VXk
The volume V(N ) of a sphere, of radius r in the N -dimensional space is:
V(N )
=
;E
N=2
;N
2
rN
+1
where ;E (x) =
Z1
e;t tx;1 dt
0
;E being the Euler function, see the mathematical appendix.
Let consider a set of classes andPa training set. Let Pk be the number of patterns of class
Ck in the training set such that Pk = P . Let Kk be the number of patterns of class Ck
k
in the hyper-sphere of volume V . Then
jC
p(x k )
=
Kk
Pk V
;
p(x)
=
K
PV
and
C
p( k )
=
Pk
P
❖ V(N )
354
APPENDIX B. STATISTICAL SIDELINES
p( )
x
X
Figure B.5:
The K{nearest{neighbors based method. The dashed line
is the real probability density p(x); the continuous line is
the estimated probability density pe(x) based on estimating
the volume of the hyper-sphere with variable radius and
represented by a dashed circle (the hyper-sphere is de ned
by xing the K number).
From the Bayes theorem
p(C j ) = p( jCp() p) (C ) = KK
k x
x
k
k
k
x
which is known as the K{nearest{neighbors classi cation rule. This means that once the
volume of the hyper-sphere was established (by xing K ) a new pattern x is classi ed as
being of that class which have most representatives (Kk ) included into the hyper-sphere,
i.e. to that class Ck for which:
p(Ck jx) = max
p(C`jx)
`
(according to the Bayes rule, see \Pattern Recognition" chapter).
✍
Remarks:
➥
The parameters governing the smoothness of the histogram method are V for
kernel based procedure and K for K nearest neighbors procedure If this tunable
parameter is to large then an excessive smoothness occurs and the resulting model
is too simple. If the parameter is chosen too small then the variance of the model
is too large, the model will approximate well the probability for the training set
but will have poor generalization, the model will be too complex.
B.2.3 The Semi{Parametric Method
The mixture model
The mixture model consider that the probability density is a superposition of probability
densities, each of them having a di erent weight by which contributes to the total.
B.2.3
See [Bis95] pp. 59{73.
B.2. MODELING THE DENSITY OF PROBABILITY
355
The procedure below is repeated for each class Ck in turn.
Considering a superposition of M probability densities then:
p(x) =
M
X
m=1
p(xjm)P (m)
m=1
✍
P (m) = 1 ; P (m) 2 [0; 1] and
Z
X
The training set have the patterns classi ed in classes but does not have the
patterns classi ed by the superposition components m, i.e. the training set is
incomplete and the complete model have to provide a mean to determine this.
The posterior probability is given by the Bayesian theorem and is normalized:
➥
M
X
m=1
P (mjx) = 1
(B.16)
The problem is to determinate the components of the superposition of probability densities.
✍
Remarks:
➥
One possibility is to model the conditional probability densities p(xjm) as Gaussian, de ned by the parameters m and m = m I :
2
p(xjm) = (221 )N=2 exp ; kx ;22 m k
m
➥
m
(B.17)
and then a search for optimal parameters and may be done.
To avoid singularities the conditions:
m 6= xp m = 1; M ; p = 1; P and m 6= 0 m = 1; M
have to be imposed (xp are the training vectors).
The maximum likelihood method
The parameters are searched by maximizing the likelihood function, de ned as3 :
L=
3 See
\Pattern Recognition" chapter.
P
Y
p=1
p(xp jW )
❖
p(xjm), P (m)
p(xjm) dx = 1
Remarks:
)P (m) and
P (mjx) = p(xjm
p(x)
M
(B.15)
where p(xjm) represents the probability density of pattern as being x, from all patterns
generated by component m of the superposition, and the weight is P (m), the prior probability of the pattern x having been created by the component m of the superposition.
M becomes also a parameter of the model.
All these probabilities have to be normalized:
M
X
❖
(B.18)
incomplete
training set
356
APPENDIX B. STATISTICAL SIDELINES
equivalent to minimize the negative log-likelihood function E
an error function.
E = ; ln L = ;
✍
P
X
p=1
p
ln (xp ) =
;
P
X
p=1
ln
" M
X
m=1
=
; ln L which may act as
p(xp jm)P (m)
#
(B.19)
Remarks:
➥
Considering the Gaussian model then from (B.16) and (B.17):
#
"M
X p(xp )P (mjxp ) kxp ; m k2
E = ; ln
exp ;
2
2 N=2
2m
p=1 m=1 (2m )
P
X
The minimum of E is found by searching for the roots of its derivative, i.e. E is
minimum for those values of m and m for which:
rm E = 0 ,
@E
@m
,
=0
P
X
p=1
P
X
p=1
P (mjxp ) xp ;2 m
m
=0
and
2
P (mjxp ) N ; kxp ;2m k = 0
m
(the denominator of derivative shouldn't be 0 anyway, also m 6= 0, see previous
remarks).
The above equations gives the following estimates for m and m parameters:
P
P
P
P
P (mjxp )xp
kxp ; m k2 P (mjxp )
p=1
p=1
e m = P
and em =
P
P
P
N P (mjxp )
P (mjxp )
(B.20)
p=1
p=1
In order to automatically ensure normalization, the P (m) parameters may be expressed by
the means of M parameters m as follows:
em
P (m) = P
M
eq
m 2 R;
m = 1; M
q=1
softmax
function
These expressions are called softmax functions. Then the m are also parameters of the
model and E which depends upon them have to be minimized with respect to them, i.e. its
derivative with respect to m should be 0 at minimum.
(q )
From the softmax expression: @P
@ m
=
mq P (m) ; P (m)P (q), also @@Em
(mq being the Kronecker symbol) and then from (B.19):
=
M @E @P (q)
P
@P (q) @ m ,
q=1
)
P
M (X
@E = X
p(xp jq)
[mq P (m) ; P (m)P (q )]
M
@ m q=1 p=1 P
p(xp j`)P (`)
`=1
B.2. MODELING THE DENSITY OF PROBABILITY
=
357
P p(x jm)P (m)
M X
P p(x jq )P (q )
X
X
p
p
;
P
(
m
)
M
M
P
P
p=1
q=1 p=1
`=1
p(xp j`)P (`)
`=1
p(xp j`)P (`)
By applying the Bayes theorem and considering the normalization of P (mjx), see (B.16),
the rst term becomes:
P P (mjx ) p(x )
P p(x jm)P (m) X
P
X
p
p = X P (mjx )
p
=
p
M
M
P
p=1 P
p=1
p=1
`=1
p(xp j`)P (`)
p(xp )
`=1
P (`jxp )
while the second term is:
P p(xp )
X
M
P
P (qjxp )
p
(
xp jq) P (q )
q=1
= PP (m)
= P (m)
P (m)
M
M
p=1 p(xp ) P P (`jxp )
q=1 p=1 P p(xp j`)P (`)
M X
P
X
`=1
`=1
so nally:
P
@E = 0 , P (m) = 1 X
@ m
P p=1 P (mjxp )
(B.21)
The EM (expectation{maximisation) algorithm
The algorithm works with formulas (B.20) and (B.21) iteratively, in steps, starting with
some initial values for the parameters at the rst step t = 1 and then recalculating e (t+1)m ,
e(t+1)m and the estimated Pe(t+1) (m) by using the old values at the previous step e (t)m ,
e(t)m and Pe(t) (m). It is supposed that E function gets smaller at each step till it reaches
the minimum.
The variation in the error function E (given by (B.19)), from one step to the next, is:
E = E t ; E t = ;
( +1)
( )
P
X
p(t+1) (xp )
p=1
ln p (x )
t p
( )
and, using (B.15) for p(t+1) (xp ):
3
2P
M
p
(
x
j
m
)
P
(
m
)
P 6
p
(t+1)
(t+1)
X
P(t) (mjxp ) 77
E = ; ln 64 m=1
5
p
(
x
)
P
p
(t)
(t) (mjxp )
p=1
✍
Remarks:
➥
(B.22)
(B.23)
The Jensen's inequality states that given a function f , convex down on an interval
[a; b], a set of P points in that interval fxpgp=1;P 2 [a; b] and a set of numbers
Jensen's inequality
358
APPENDIX B. STATISTICAL SIDELINES
f p gp
;P
=1
2 [0; 1] such that
f
P
P
p=1
P
X
p=1
p = 1, then:
!
p xp
>
P
X
p=1
p f (xp )
see the mathematical appendix.
By applying the Jensen's inequality to (B.23): f , ln and p , P(t) (mjxp ) (P(t) (mjxp )
are normated) then:
P X
M
X
P(t+1) (m) p(t+1) (xp jm)
E 6
P(t) (mjxp ) ln p (x ) P (mjx ) Q
p (t)
p
(t)
p=1 m=1
❖
Q
(B.24)
and E(t+1) 6 E(t) + Q, i.e. E(t+1) is bounded above and it may be minimized by minimizing
Q. Generally Q = Q(W(t+1) ), the old parameters W(t) being already established at the
previous t step. Eventually, minimizing Q is equivalent to minimizing:
Qe = ;
M
P X
X
p=1 m=1
P(t) (mjxp ) ln P(t+1) (m) p(t+1) (xk jm)
(B.25)
For the Gaussian distribution, see (B.17), Qe becomes
Q=;
e
P X
M
X
p=1 m=1
"
P(t) (mjxp ) ln P(t+1) (m) ; N ln (t+1)m ;
kxp ; t
2 t
mk
( +1)
2
( +1)
m
2
#
+ const.
The problem is to minimize Qe with respect to (t +1) parameters, i.e. to nd the parameters
at step t + 1, such that the condition of normalization for P(t+1)(m) (
M
P
m=1
P(t+1) (m) = 1)
is met. The Lagrange multiplier method4 is used here. The Lagrange function is:
L=Q+
e
"
M
X
m=1
P(t+1) (m) ; 1
#
The value of the parameter is found by putting the set of conditions: @P t@L (m)
which gives:
( +1)
;
=0
P
X
P(t) (mjxp )
+ = 0 ; m = 1; M
p=1 P(t+1) (m)
and by summation over m, and because both P(t) (mjxp ) and P(t+1) (m) are normated:
M
X
P(t) (mjxp ) = 1 and
m=1
4 See mathematical appendix.
M
X
m=1
P(t+1) (m) = 1
B.2. MODELING THE DENSITY OF PROBABILITY
359
then = M .
The required parameters are found by setting the conditions:
r t
( +1)
mL = 0
@L
; @
m
=0
@L
and
@P(t+1) (m) = 0
which gives the solutions:
P
P
P(t) (mjxp ) xp
p=1
(t+1)m = P
P
P(t) (mjxp )
p=1
v
uP
u P
u P(t) (mjxp )kxp ; (t+1)m k2
u p=1
(t+1)m = u
u
P
P
t
N P(t) (mjxp )
p=1
P(t+1) (m) = N1
P
X
p=1
(B.26a)
(B.26b)
P(t) (mjxp )
(B.26c)
As discussed earlier (see the remarks regarding the incomplete training set) usually the data
available is not classi ed in terms of probability components m, m = 1; M . A variable zp ,
zp 2 [1; : : : ; M ] may be associated with each training vector xp , to hold the probability
component. The error function then becomes:
E = ; ln L = ;
=
;
P X
M
X
p=1 m=1
P
X
p=1
p
ln (t+1) (xp
; zp ) = ;
P
X
p=1
P
ln[ (t+1) (
zp ) p(t+1) (xp jzp )]
mzp ln[P(t+1) (zp ) p(t+1) (xp jzp )]
P(t) (zp jxp ) represents the probability of zp for a given xp , at step t. The probability of
P
Q
E for a given set of fzp gp=1;P is the product of all P(t) (zp jxp ), i.e. P(t) (zp jxp ). The
p=1
expectation EfE g is the sum of E over all values of fzp gp=1;P weighted by the probability
of the fzpgp=1;P set:
EfE g = ;
M
X
P
Y
E P(t) (zp jxp )
zP =1 p=1
#
P
M
M
P X
M "X
Y
X
X
mzp P(t) (zq jxq ) ln[P(t+1) (zp ) p(t+1) (xp jzp )]
=;
q=1
p=1 m=1 z1 =1 zP =1
z1 =1
M
X
❖ zp
360
APPENDIX B. STATISTICAL SIDELINES
On similar grounds as for EfE g, the expression in square parenthesis from the above equation
represents the expectation Efmzp g:
Efmzp g =
M
X
z1 =1
M
X
zP =1
mzp
P
Y
p=1
P(t) (zp jxp ) = P(t) (mjxp )
which represents exactly the probability P(t) (mjxp ). Finally:
EfE g = ;
P X
M
X
p=1 m=1
P(t) (mjxp ) ln[P(t+1) (m) p(t+1) (xp jm)]
which is identical with the expression of Qe, see (B.25), and thus minimizing Qe is equivalent
to minimizing EfE g at the same time.
Stochastic estimation
Let consider that the training vectors came one at a time. For a set of P training vectors
the (P )m parameter from the Gaussian distribution is (see (B.26a)):
(P )m =
P
P
P (mjxp ) xp
p=1
P
P
P (mjxp )
p=1
and, after the P + 1 training vector have arrived:
(P +1)m =
PP
+1
P (mjxp ) xp
p=1
PP
+1
P (mjxp )
p=1
(P )m + (P +1)m xP +1 ; (P )m
=
where
P (mjxP +1 )
m = PP
+1
P (mjxp )
p=1
P
( +1)
To avoid keeping all old fxp gp=1;P (to calculate
P
m=
( +1)
PP
+1
P (mjxp )) use either (B.21) such that:
p=1
P (mjxP +1 )
(P + 1)P (m)
or, directly from the expression of (P +1)m :
P
( +1)
m=
1
1+
P (mjxP )
(P )m P (mjxP +1 )
B.3. THE BAYESIAN INFERENCE
➧ B.3
361
The Bayesian Inference
Unlike other techniques where the probabilities are build by nding the best set of W
parameters, the Bayesian inference assumes a probability density for W itself.
The following procedure is repeated for each class Ck in turn. First a prior probability is
chosen p(W ), with a large coverage for the unknown parameters, then using the training
set fxp gP the posterior probability density p(W jfxp gP ) is found trough the Bayes theorem.
The process of nding p(W jfxp gP ), from p(W ) and fxp gP , is named Bayesian learning.
Let p(xjfxp gP ) be the probability density for a pattern from fxp gP to have its pattern
vector x and let p(x; W jfxp gP ) be the join probability density that a pattern from fxp gP
have its pattern vector x and the parameters of the probability density are de ned trough
W . Then:
p(
xjfxp gP ) =
Z
p(
x; W jfxp gP ) dW
❖ fxp gP
Bayesian
learning
❖
p(xjfxp gP ),
x; W jfxp gP )
p(
(B.27)
W
the integral being done over all possible values of W , i.e. in the W space.
p(x; W jfxp gP ) represents the ratio of pattern vectors being x with their probability density
characterized by W and being into the training set fxp gP relative to the total number of
training vectors:
no. patterns being x; with W; in fxp gP
p(x; W jfxp gP ) =
no. patterns in fxp gP
xjW; fxp gP ) represents the ratio of pattern vectors being x with their probability density
characterized by W and being into the training set fxp gP relative to the number of training
vectors with their probability density characterized by W from the training set fxp gP :
no. patterns being x; with W; in fxp gP
p(xjW; fxp gP ) =
no. patterns with W; in fxp gP
❖
p(
represents the ratio of pattern vectors with their probability density characand being into the training set fxp gP relative to the number of training
❖
p(W
p(
p(W
jfxp gP )
terized by
vectors:
W
p(W
jfxp gP ) =
no. patterns with W; in fxp gP
no. patterns in fxp gP
Then, from the above equations, it gets that:
x; W jfxp gP ) = p(xjW; fxp gP ) p(W jfxp gP )
p(
The probability density p(xjW; fxp gP ) have to be independent of the choice of the statistically valid training set (same ratio should be in any training set) and consequently it
reduces to p(xjW ). Finally:
xjfxp gP ) =
Z
p(
xjW )p(W jfxp gP ) dW
p(
W
(B.28)
xjW; fxp gP )
jfxp gP )
362
APPENDIX B. STATISTICAL SIDELINES
✍
Remarks:
By contrast with other statistical methods who tries to nd the best set of parameters W , the Bayesian inference method performs an weighted average over the
W space, using all possible sets of W parameters according to their probability
to be the right choice.
The probability density of the whole set fxp gP is the product of densities for each xp
(assuming that the set is statistically signi cant):
➥
p(fxp gP jW ) =
P
Y
p=1
p(xp jW )
From the Bayes theorem and using the above equation:
p(W jfxp gP ) =
p(fxp gP jW ) p(W )
p(fxp gP )
=
P
p(W ) Y
p(x jW )
p(fxp gP ) p=1 p
(B.29)
p(fxp gP ) plays theR role of a normalization factor, and from the condition of normalization
for p(W jfxp gP ): W p(W jfxp gP ) dW = 1, it follows that:
p(fxp gP ) =
Z
p(W )
W
✍
❖ ,
P
Y
p=1
p(xp jW ) dW
(B.30)
Remarks:
➥
Let consider a unidimensional Gaussian distribution with the standard deviation
known and try to nd the parameter from a training set fxp gP .
The probability density of will be modeled also as a Gaussian characterized by
parameters and .
p() =
p
1
2
exp
; ( ;22 )
2
(B.31)
where this form of p() expresses the prior knowledge of the probability density
for and then a large value for should be chosen (large variance).
Using the training set, the posterior probability p(jfxp gP ) is calculated for :
p(jfxp gP ) =
P
p() Y
p(x j)
p(fxp gP ) p=1 p
(B.32)
Assuming a Gaussian distribution for p(xp j) then:
p(xp j) =
p
1
2
exp
;
(xp
; )2
2 2
(B.33)
From (B.30) and (B.33)
p(fxp gP ) =
Z1
;1
p()
P
Y
p=1
p(xp j) d
(B.34)
B.3. THE BAYESIAN INFERENCE
=
Let hxi =
Z1
1
(2) P 2+1 P ;1
PP x
p=1
363
p
#
"
P
1 X
)2
2
;
exp ; ( ;
22
22 p=1( ; xi ) d
be the mean of the training set.
❖
Replacing (B.31), (B.33) and (B.34) back into (B.32) gives:
p(jfxp gP ) / exp
; P22 ; Ph2xi
; 22 ;
2
2
2
)
2
0 P hxi 12 3
1 + P
2
2
2 + 2
p(jfxp gP ) = const. exp 6
4; 2 @ ; 12 + P2 A 75
(const. being the normalization factor). This expression shows that p(jfxp gP )
is also a Gaussian distribution characterized by the parameters:
P 2 hxi + 2 2
and e =
e = 2
P + 2
➥
s
2 2
P 2 + 2
For P ! 1: e ! hxi and e ! 0. Plim
e = 0 shows that, for P ! 1, itself
!1
will assume the limit value hxi.
There is a relationship between the Bayesian inference and maximum likelihood
methods. The likelihood function is de ned as:
p(fxp gP jW ) =
P
Y
p=1
p(xp jW ) L(W )
then from (B.29):
p(W jfxp gP ) =
L(W )p(W )
p(fxp gP )
p(W ) represents the prior knowledge about W which is low and so p(W ) should
be relatively at, i.e. all W have (relatively) the same probability (chance). Also,
f
by construction, L(W ) is maximum for the most probable value for W , let W
f.
be that one. Then p(W jfxp gP ) have a maximum around W
f, is relatively sharp then for P ! 1 the
If the p(W jfxp gP ), maximum in W
f and:
integral (B.28) is dominated by the area around W
p(xjfxp gP ) =
Z
p(xjW ) p(W jfxp gP ) dW
W
f)
' p(xjW
Z
W
f)
p(W jfxp gP ) dW = p(xjW
hxi
364
APPENDIX B. STATISTICAL SIDELINES
x
f (w )
w
Figure B.6:
(because
R
W
0
w
The regression function f (w) approximate the x(w) dependency. The roots w0 of f are found by the Robbins{
Monro algorithm.
p(W jfxp gP ) dW
=1
, i.e. p(W jfxp gP ) is normalized).
The conclusion is that for a large number of training patterns P ! 1 the
Bayesian inference solution (B.27) approaches the maximum likelihood solution
f ).
p(xjW
➧ B.4
❖
f
The Robbins;Monro algorithm
This algorithm shows a way of nding the roots of a function stochastically de ned.
Let consider 2 variables x and w which are correlated x = x(w). Let Efxjwg be the
expectation of x for a given w | this expression de nes a function of w
f (w) = Efxjwg
regression
function
this types of functions being named regression functions.
The regression function f expresses the dependency between the mean of x and w. See
gure B.6.
Let w0 be the wanted root. It is assumed that x have a nite variance:
Ef(x ; f )2 jwg = nite
(B.35)
and, without any loss of generality, that f (w) < 0 for w < w0 and f (w) > 0 for w > w0 ,
as in gure B.6.
Theorem B.4.1. The root w0 , of the regression function f is found by successive iteration,
starting from a value w1 in the vicinity of w0 , as follows:
wi
B.4
See [Fuk90] pp. 378{380.
+1 = wi ; i x(wi )
B.4. THE ROBBINS;MONRO ALGORITHM
where
i
365
have to satisfy the following conditions:
lim
!1
i
X1
X1
i
=0
(B.36a)
i
=1
(B.36b)
=
(B.36c)
i=1
i=1
2
i
nite
The x(w) may be expressed as a sum between the regression function f (w) and some noise :
x(w) = f (w) +
(B.37)
then, from the de nition of the regression function f
Proof.
❖
Efjwg = Efxjw) ; f (wg = 0
(f is well de ned, so its expectation is f itself).
The di erence between wi+1 and w0 is
wi+1 ; w0 = wi ; w0 ; i f (wi ) ; i i
where i is the noise from x(wi ). Taking the expectation of the square of the above expression:
Ef(wi+1 ; w0 )2 g ; Ef(wi ; w0 )2 g = 2i f 2 (wi ) + 2i Ef2i g ; 2 i Ef(wi ; w0 )f (wi )g
(Eff (wi )i g = Eff (wi )gEfi g because f and are statistically independent, so 2 i Eff (wi )i g =
2 i f (wi )Ef i g = 0 because of the expectation of ).
By repeating the above procedure over N steps and doing the sum gives:
N
N
X
2i f 2 (wi ) + Ef2i g ; 2 X i Ef(wi ; w0 )f (wi )g
Ef(wN +1 ; w0 )2 g ; Ef(w1 ; w0 )2 g =
i=1
i=1
It is reasonable to assume that w1 is chosen such that f 2 is bounded in the chosen vicinity of the searched
root, then let f 2 (wi ) 6 F for all i 2 f1; : : : N +1g and let Ef2i g 6 2 . (it is assumed that 2 is bounded,
see (B.35) and (B.37)). Then:
N
N
X
2i ; 2 X i Ef(wi ; w0 )f (wi )g
Ef(wN +1 ; w0 )2 g ; Ef(w1 ; w0 )2 g 6 (F ; 2 )
(B.38)
i=1
i=1
Ef(wN +1 ; w0 )2 g > 0 as being the expectation of a positive quantity. As it was already discussed, w1
is chosen such that E [(w1 ; w0 )2 ] is limited. Then the left part of the equation (B.38) is bounded below
by 0.
N;
P1 2
term is also nite because of the condition (B.36c) then, from (B.38):
The (F ; 2 )
i=1 i
N
N
X
X
2i = nite )
062
i Ef(wi ; w0 )f (wi )g 6 (F ; 2 )
i=1
i=1
N
X
i Ef(wi ; w0 )f (wi )g = nite
i=1
Because of the conditions put on the signs of f and wi then (wi ; w0 )f (wi ) > 0, 8wi , and its expectation
is also positive. Eventually, because of the condition (B.36b) and the above equation then:
lim Ef(wi ; w0 )f (wi )g = 0
i!1
i.e. i!1
lim wi = w0 . and because f changes sign around w0 then it's a root for the regression function.
❖ i
366
APPENDIX B. STATISTICAL SIDELINES
✍
Remarks:
➥
➥
➥
Condition (B.36a) ensure that the process of nding the root is convergent.
Condition (B.36b) ensure that each iterative correction to the solution wi is large
enough. Condition (B.36c) ensure that the accumulated noise (the di erence
between x(w) and f (w)) does not break the convergence process.
The Robbins{Monro algorithm allows for nding the roots of f (w) without knowing the exact form of the regression function.
A possible candidate for i coecients is
i =
➧ B.5
❖
mk
codebook
✍
➥
Vector quantization is used in signal processing to replace a vector x 2 Ck by
a representative mk for it's class. The collection of all representatives fmk g is
named codebook . Then, instead of sending each x, the codebook is send rst
and then just the index of mk (closest to x) instead of x.
The representatives are chosen by starting with some set and then updating it
using a rule similar to the following: at step t a vector x from the \training set"
is taken in turn and the representatives are updated/changed in discrete steps:
( +1)
t
LVQ1
; n 2 (1=2; 1]
Remarks:
( +1)
( )
in
Learning vector quantization
mk t
m` s
❖
1
where
mk t
= mj s
=
( )
+
t
( )(
x ; mk t ) for mk closest to x
( )
no change, for all other m`
( )
t is a function of the \discrete time" t, usually decreasing in time.
( )
The mk vectors should be chosen as representative as possible. It is also possible to choose
several mk vectors per each class (this is particularly necessary if a class is represented by
some disconnected subspaces Xk 2 X in the pattern space).
There are several methods of updating the code vectors. The simplest method is named
LVQ1 and the updating rule is:
mk t = mk t + (t) (x ; mk t ) for x 2 Ck
(B.39)
m` t = mk t ; (t) (x ; mk t ) for all other m`
i.e. not only the mk is pushed towards an \average" of class Ck but all others are spread in
OLVQ1
k
( +1)
( )
( )
( +1)
( )
( )
the opposite direction. The (t) is chosen to start with some small value, (0) < 0:1 and
decreasing linearly towards zero.
Another variant, named OLVQ1, uses the same rule (B.39) but di erent k (t) function for
B.5. LEARNING VECTOR QUANTIZATION
each mk
367
k
(t + 1) = 1 +k (t)(t) for x 2 Ck
k
(t + 1) = 1 ;k (t)(t) otherwise
(B.40)
k
k
i.e. k is decreased for the correct class and increased otherwise. The above formula may
be justi ed as follows. Let (t) be de ned as:
❖
(t)
(
+1 for correct classi cation
;1 otherwise
then, from (B.39), it follows that:
mk(t+1) = mk(t) + (t) (t) [x(t) ; mk(t)] = [1 ; (t) (t)]mk(t) + (t) (t) x(t)
)
mk t = [1 ; (t ; 1) (t ; 1)]mk t; + (t ; 1) (t ; 1) x t;
mk t = [1 ; (t) (t)][1 ; (t ; 1) (t ; 1)]mk t;
(B.41)
+ [1 ; (t) (t)](t ; 1) (t ; 1) x t; + (t) (t) x t
i.e. the code vectors mk t are a linear combination of the training set and the initial code
vectors mk .
Now let consider that x t; = x t , then consistency requires that mk t = mk t and
thus (t ; 1) = (t). x t; and x t being identical should have the same contribution to
mk t , i.e. their coecients in (B.41) should be equal:
[1 ; (t) (t)] (t ; 1) = (t)
( )
(
1)
(
( +1)
(
(
1)
1)
1)
( )
( +1)
(0)
(
(
1)
1)
( )
( )
( +1)
( )
( +1)
which leads directly to (B.40).
Another procedure to build the code vectors is LVQ2.1. For each x the two closest code
vectors are found. Then these two code vectors are updated if:
one is of the same class as x, let m= be this one,
the other is of di erent class, let m6= be that one and
x is near the middle between the two code vectors:
f
k
x
;
m
= k kx ; m6= k
min kx ; m k ; kx ; m k > 11 ;
+
f
6=
=
where f ' 0:25.
✍
Remarks:
➥
The last rule from above may be geometrically interpreted as follows: let consider
that the x is suciently close to the line connecting m= and m6=. See gure B.7
on the next page.
In this case it is possible to make the approximations:
kx ; m=k ' d= and kx ; m6=k ' d6=
LVQ2.1
❖
m , m6 , f
=
=
368
APPENDIX B. STATISTICAL SIDELINES
x
kx ; m6 k
=
m6
=
kx ; m k
=
d6=
m
d=
=
Figure B.7: The geometrical interpretation of the LVQ2.1 rule. The
x point is projected onto the line connecting m and m6 .
=
and there are 2 cases, one of them being
d
;f
= d6=
min d ; d = dd= > 11 +
f
6= =
6=
) d = d6 + d < d 1 ;2 f
=
=
=
=
) d6 < d 11 ;+ ff
) d > 1;f
=
=
=
d
2
the other case giving a similar result: dd= > 1;2 f , i.e. in either case the projection
of x is at least at a fraction (1 ; f )=2 of d from m= and m6=.
The updating formula for LVQ2.1 is:
m=(t+1) = m=(t) + (t)[x ; m=(t)]
6
m6
=(t+1)
= m6=(t) ; (t)[x ; m6=(t)]
While the LVQ2.1 updates the codebook less frequently than the previous procedures it
tends to over adapt the code vectors.
The LVQ3 was developed to prevent the over adaptation of LVQ2.1 method. This algorithm
is similar to LVQ2.1 but if the two closest code vectors, let m0= and m00= be the ones, are
of the same class then they are also updated according to the formula:
m0=(t+1) = m0=(t) + " (t)[x ; m0=(t)] and
m00 t
=( +1)
= m00=(t) + " (t)[x ; m00=(t)]
where " is a tunable parameter, usually chosen in the interval [0:1; 0:5].
✍
Remarks:
➥
In practice usually OLVQ1 is run rst (on the rapid changing code vectors part)
and then LVQ1 and/or LVQ3 is used to make the ne adjustments.
Bibliography
[BB95]
[Bis95]
[BTW95]
[CG87]
[FS92]
[Fuk90]
[Koh88]
[McC97]
[Mos97]
[Rip96]
[Str81]
James M. Bower and David Beeman. The Book of Genesis. Springer-Verlag,
New York, 1995. ISBN 0{387{94019{7.
Cristopher M. Bishop. Neural Networks for pattern recognition. Oxford
University Press, New York, 1995. ISBN 0{19{853864{2.
P.J. Braspenning, F. Thuijsman, and A.J.M.M. Weijters, Eds. Arti cial Neural
Networks. Springer-Verlag, Berlin, 1995. ISBN 3{540{59488{4.
G. A. Carpenter and S. Grossberg. Art2: self-organization of stable category
recognition codes for analog input patterns. Applied Optics, 26(23):4919{
4930, December 1987. ISSN 0003{6935.
J. A. Freeman and D. M. Skapura. Neural Networks, Algorithms, Applications
and Programming Techniques. Addison-Wesley, New York, 2nd edition, 1992.
ISBN 0{201{51376{5.
Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, 2nd edition, 1990. ISBN 0{12{269851{7.
Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag,
New York, 2nd edition, 1988. ISBN 0{201{51376{5.
Martin McCarthy. What is multi-threading? Linux Journal, {(34):31{40,
February 1997.
David Mosberger. Linux and the alpha: How to make your application y,
part 2. Linux Journal, {(43):68{75, November 1997.
Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge
University Press, New York, 1996. ISBN 0{521{46086{7.
Karl R. Stromberg. An Introduction to Classical Real Analysis. Wadsworth
International Group, Belmont, California, 1981. ISBN 0{534{98012{0.
369
Index
2/3 rule, 85, 86, 94, 98, 101
certain event, 210
city-block metric, see error, city-block metric
class, 111
classi cation, 116, 197
clique, 307
codebook, 366
complexity criteria, 273
conditional independence, 308
confusion matrix, 120
conjugate direction, 226, 228
conjugate directions, 226
contrast enhancement, 86, 108
function, see function, contrast enhancement
counterpropagation network, see network, CPN
course of dimensionality, 113, 237, 255
cross-validation, 273
curse of dimensionality, 241
cycle, 307
activation function, see function, activation
adaline, see percepton
adaptive backpropagation, see algorithm, backpropagation, adaptive
Adaptive Resonance Theory, see ART
algorithm
ART1, 99
ART2, 107
backpropagation, 16, 162, 320
adaptive, 21, 23, 321
batch, 163
momentum, 20, 23, 24, 321
quick, 224
standard, 13, 19
SuperSAB, 23, 321
vanilla, see standard
BAM, 49, 53, 322
branch and bound, 242
CPN, 81
EM, 181, 240, 357
expectation{maximization, see EM
gradient descent, 218
Levenberg-Marquardt, 234
line search, 225
Metropolis, 296
model trust region, 235
optimal brain damage, 266
optimal brain surgeon, 267
Robbins{Monro, 124, 181, 219
sequential search, 243
simulated annealing, 296
SOM, 40, 321
ANN, iii, 3
output, 112, 190
asymmetric divergence, 125, 211
DAG, 307, 313
decision
boundary, 116, 117, 119
region, 116
tree, 301
splitting, 301
delta rule, 12, 142, 143, 145, 146, 162, 163, 219
deviance, 127
dimensionality reduction, 237, 245
discrete time approximation, 219
distance
Euclidean, 325, 350
Hamming, 325
Kullback{Leiber, see asymetric divergence
Mahalanobis, 241, 348
distribution, 191
Bernoulli, 136
conditional, 194
Gaussian, 191, 209, 261, 274, 345
multidimensional, 346
unidimensional, 345
Laplacian, 192
BAM, 48
Bayes
rule, 116, 119, 354
theorem, 288, 342
Bayesian learning, 361
bias, 18, 123, 129, 160, 188, 254
average, 254
bias-variance tradeo , 255
Bidirectional Associative Memory, see BAM
bit, 211
Boltzmann constant, 208
edge, 307
eigenvalue, 36, 327, 328
spectrum, 327
eigenvector, 36, 327, 347
encoding
one-of-k, 68, 73, 81, 108, 197, 203, 205,
212, 239
371
372
INDEX
entropy, 208, 210, 302
cross-entropy, 202, 203
di erential, 209
equation
Riccati, 33
simple, 32
trivial, 32
error, 163, 176, 186, 212, 215, 218
absolute, 203
bar, 191
city-block metric, 192
cross-entropy, 213
function, 114, 185
gradient, 12{15, 17
Minkowski, 192, 213
quadratic, 232
relative, 203
RMS, 115, 187
root-mean-square, see error, RMS
sum-of-squares, 11, 12, 15, 17, 19, 114,
137, 143, 149, 163, 167, 171, 176,
177, 182, 190, 203, 213, 253, 268,
274, 279{281
surface, 12, 20, 24, 26, 215, 217, 256
Euler-Lagrange equation, 176
evidence, 288
approximation, 287
exception, 115
exemplars, 47
expectation, 87, 90, 343
feature extraction, 114, 237
feedback, 31
auto, 54
indirect, 30, 91
lateral, 41, 91
feedforward network, see network, feedforward
Fisher criterion, 148, 152, 199
Fisher discriminant, 149, 265
at spot elimination, 20
Fletcher-Reeves formula, 230
function
-Dirac, 125, 176, 197, 204, 353
activation, 3{6, 40, 57, 90, 92, 134, 198,
217
exponential-distribution, 7
hyperbolic-tangent, 6
identity, 48, 175, 199
logistic, 5, 6, 10, 15, 19, 135, 163, 202,
206, 259
pulse-coded, 7
ratio-polynomial, 7
sigmoidal, 157, 285
threshold, 5, 6, 7, 135, 154
BAM
energy, 51
contrast enhancement, 102
discriminant, 117
energy
BAM, 52, 56
error, see error
Euler, 192, 292, 331, 353
even, 330
feedback, 30, 74, 75, 91
Gaussian, 174, 175
Green, 176
Heaviside, 160
Hop eld
energy, 55, 56, 57, 59
kernel, 178, 195, 248, 352
Lagrange, 334
Liapunov, 51
likelihood, 121, 123, 185, 192, 202, 205,
261, 315
multi-quadratic, 174
radial basis, 173, 174
regression, 364
softmax, 195, 206, 263, 272, 356
stop, 43
generalization, 15, 112, 115
Gini index, 302
graph, 307
ancestral, 308
boundary, 308
chordal, see graph, triangulated
complete, 307
connected, 307
directed, 307
directed acycled, see DAG
moral, 314
polytree, 308
tree, 307
chain, 310
join, 312
triangulated, 312
undirected, 307
Hadamard product, iii
Hamming
distance, see distance, Hamming
space, 50, 325
vector, see vector, bipolar
Hessian, 127, 166, 172, 256, 266, 281, 288
inverse, 166, 168
Hestenes-Stiefel formula, 230
histogram, 350
K-nearest-neighbors, 351, 353
kernel method, 351
hyperprior, 288
importance sampling, 296
information, 210
criterion, 128
input normalization, 238
invariance, 247
translation, 250
Jacobian, 164, 247
Jensen's inequality, 132, 335, 357
Karhunen-Loeve transformation, 245
kernel function, see function, kernel
Kohonen, see SOM
Kronecker symbol, 118, 165, 196, 197, 347, 356
Kroneker symbol, 38, 48, 61
INDEX
L'Hospital rule, 302
Lagrange multiplier, 334
layer
competitive, 74, 75, 88, 91
hidden, 73
input, 68
output, 76
learning, 5, 11, 112, 217
ART1
fast, 94, 95
ART2
fast, 104
batch, 219
Bayesian, 275, 276
constant, 12, 17, 22, 27, 81, 107, 142, 163,
181, 219
adaptive decreasing factor, 22
adaptive increasing factor, 22
error backpropagation threshold, 17
at spot elimination, 20, 27
momentum, 20, 27, 221
convergence, 146
incomplete, 38, 40, 45
reinforced, 112
sequential, 219
set, 5, 112, see set, learning
speed, 221
stop, 218
supervised, 5, 11, 112, 182
unsupervised, 5, 29, 31, 39, 112, 243
Learning Vector Quantization, see LVQ
least squares technique, 182
lexicographic convention, 65
linear discriminant analysis, 349
linear separability, 129, 132, 146
loss matrix, see matrix, loss
LVQ
LVQ1, 366
LVQ2.1, 367
LVQ3, 368
OLVQ1, 366
macrostate, 207
Markov
chain, 310
properties, 308
matrix
covariance, 198
between-class, 148, 151
within-class, 148, 151
loss, 117, 201
positive de nite, 329
pseudo-inverse, 140
memory, 20, 22, 24
associative, 47
interpolative, 47
autoassociative, 48, 54
crosstalk, 52
heteroassociative, 47, 67
Hop eld
continuous, 57, 60, 322
discrete, 54, 60, 322
gain parameter, 57
373
saturation, 52
microstate, 207
misclassi cation, 304
penalty, 117, 118
missing data, 97, 239, 306
mixture model, 179, 194
mixture of experts, 271
moment, 248
central, 248
regular, 248
momentum, see algorithm, backpropagation, momentum
Monte Carlo method, 295
multiplicity, 207
Nadaraya-Watson estimator, 178
nat, 211
network
ART, 85, 155
ART1, 85
ART2, 100
autoassociative, 245
backpropagation, 9
BAM, 48
cascade correlation, 264
committee, 218, 268
CPN, 67, 155
architecture, 67
feedforward, 3, 4, 9, 153, 154, 170
growing method, 264
higher order, 5, 250
Hop eld
continuous, 57
discrete, 54
Kohonen, see network, SOM
layer
hidden, 26
output, 197
performance, 113
pruning method, 264
recurrent, 3
SOM, 4, 29
neuron, 3
excitation, 68
gain control, 85, 88
hidden, 72
instar, 71, 73
neighborhood, 31, 39, 41, 42
function, 41
output, 70
outstar, 76
pruning, 268
reset, 85
saliency, 268
winner, 41
neuronal neighborhood, see neuron, neighborhood
Newton direction, 233
noise, 52, 97, 98, 186, 253
norm
LR , 192
Euclidean, 102, 107, 189
NP-problem, 60
374
INDEX
Occam factor, 293
operator
Nabla, 126
outlier, 113, 188, 193, 343
overadaptation, 142
overtraining, 16
Parzen window, see function, kernel
path, 307
pattern, 111
re ectance, 71
space, 112
perceptron, 136, 144, 179
Polak-Ribiere formula, 230
postprocessing, 237
prediction, 191
preprocessing, 237, 248
principal component, 243, 245
probability
class-conditional, 342
density, 343
distribution, 342, 343
doubt, 117
join, 341
misclassi cation, 116
posterior, 116, 197, 342
prior, 341
pruning, 266
random walking, 296
Rayleigh quotient, 38, 328
recursive model, 314
re ectance pattern, see pattern, re ectance
regression, 112
regularization, 115, 247, 255
parameter, 176, 248, 255
weight decay, 280
reject area, 119
rejects, 113
representation
marginal, 313
potential, 314
set-chain, see representation, marginal
risk, 117
averaging, 299
Self Organizing Maps, see SOM
sequential parameter estimation, 123
set
learning, 10, 114
test, 15, 199
training, 253
validation, 218, 258
signal function, see function, activation
SOM, 40
SuperSAB, see algorithm, backpropagation, SuperSAB
theorem
noiseless coding, 211
training, see learning
variance, 191, 218, 238, 254, 343
average, 254
input dependent, 193
vector
binary, 50, 96, 99, 136, 155
bipolar, iv, 50, 52, 53
bynary, iv
code, 366
normal distribution, 349
orthogonal, 50
target, 114, 197
large, 203
threshold, 55
vertex, 307
child, 307
parent, 307
root, 307
vigilance parameter, 97
weight, 112
decay, see regularization, weight decay
e ective number, 274, 289
saliency, 266
shared, 250
soft, 261
space, 31, 39
symmetry, 161, 215
well-determined, 289
whitening, 238
winner-takes-all, see layer, competitive
XOR problem, 134