Network Learning (Training)
Network Learning (Training)
Network Learning (Training)
Supervised learning
Supervised learning
Teaching
Input Comparator
Input Neural
Pattern Network
Example
Weight
Adaptation
Network learning (training)
Unsupervised learning
Unsupervised learning
Example
Weight Adaptation
Supervised Learning
Supervised learning: Hebb’s Law
• Earliest proposal of learning model
• Correlational model which tries to justify learning in
biological neural networks
• “if two neurons i,j are active (high outputs oi,oj) at the same
time, the weight wij associated with the corresponding
connection must be increased”
• Dwij = e oioj e = Learning Rate
• Problems:
- It does not always lead to correct results
- If one keeps showing the net the same examples
weights increase indefinitely (this is not plausible
biologically and leads to saturation problems)
Supervised learning: Hebb’s Law
. .
i1 oP,i+k wi+k,j
.
i2 . ni
. . oP,i nj
. . nN
. oP,N OP
. .
. . .
. .
oP,i+k
. ni+k
iM
.
oP,M
nM
Supervised learning: Delta Rule
(Widrow & Hoff’s rule)
Given:
• A single-layer network with linear activations ( oj(W) = Si wij ii )
• a training set T = { (xp, tp) : p = 1, …., P} P = n. of examples
• a squared error function computed on the pth pattern
Ep = Sj =1,N (tpj -opj)2 / 2 N = n. of output neurons,
opj,tpj= output/teaching input for neuron j
• a global error function
E = Sp=1,P Ep = E(W), W = weight matrix of weights wij
associated with the connections ij
(from neuron i to neuron j)
E can be minimized using a gradient descent, converging to the local
minimum of E which is closest to the initial conditions
(corresponding to the values at which weights are initialized).
Supervised learning: gradient descent
x y
Basin of Basin of
attraction attraction
for x for y
Supervised learning: Delta Rule
(Widrow & Hoff’s rule)
Problems
• It is not possible, in general, to compute the gradient of the
error function with respect to all the weights for any
network configuration. However, some configurations (see
the following slides) do allow gradients to be computed or
approximated.
• Even when this is possible, gradient descent will find a local
minimum, which may be very far from the global one.
Supervised learning:
Generalized Delta Rule (Backpropagation)
The delta rule can be applied only to a particular neural net
(single-layer with linear activation functions).
The delta rule can be generalized and applied to multi-layer
networks with non-linear activation functions.
This is possible for feedforward networks (aka Multi-layer
Perceptrons or MLP) where a topological ordering of neurons
can be defined, as well as a sequential order of activation.
The activation function f(net) of all neurons must be continuous,
differentiable and non-decreasing.
netpj= Si wij opi for a multi-layer network
(i = index of a neuron forward-connected to j , i.e., i<j if neurons are ordered
starting from the input layer)
Supervised learning:
Generalized Delta Rule (Backpropagation)
We want to apply gradient descent to minimize the total
squared error (this holds for any topology):
Dpwij = - e Ep/wij
Notice that EP = EP (Op) = EP (f(netp(W))
Applying the differentiation rule for composite functions:
Ep/netpj
Ep/wij = Ep/opj · opj/netpj· netpj/wij
netpj/wij = /wij (Sk wkj opk ) = opi
Supervised learning:
Generalized Delta Rule (Backpropagation)
If we define: dpj = - Ep/netpj we obtain:
Ep/wij = e dpj opi
(same formulation as for the Delta rule)