Network Learning (Training)

Network learning (training)
Supervised learning
Each example has two components:

• an input pattern,
• a teaching input, which specifies the output which the
network is expected to produce corresponding to that
pattern
Learning algorithms adapt weights to minimize the difference
between the actual outputs produced by the network when
the training patterns are input, and the corresponding desired
output (teaching input).
Supervised learning
Teaching
Input Comparator
Input Neural
Pattern Network
Example
Weight
Adaptation
Unsupervised learning
Examples comprise only input patterns.

• The network self-adapts its weights such that they
reproduce some features and regularities of the training set
• Because of this, this kind of learning is often associated to
the terms regularity discovery or feature detection
Artificial neural networks: training
Unsupervised learning
Input Pattern Neural Network
Example
Weight Adaptation
Supervised Learning
Supervised learning: Hebb’s Law
• Earliest proposal of learning model
• Correlational model which tries to justify learning in
biological neural networks
• “if two neurons i,j are active (high outputs oi,oj) at the same
time, the weight wij associated with the corresponding
connection must be increased”
• Dwij = e oioj e = Learning Rate
• Problems:
- It does not always lead to correct results
- If one keeps showing the net the same examples
weights increase indefinitely (this is not plausible
biologically and leads to saturation problems)
Supervised learning: Hebb’s Law
Given a single-layer network with linear activation functions

and real inputs Hebb’s law works only if the input vectors
are an orthogonal set.
In an N-dimensional input space, at most N correct
associations can be learnt.
In any case the law is important because:
• Some more powerful learning paradigms are loosely
inspired to it.
• It is a basic law for other learning laws to compare with.
Supervised learning: Perceptrons
• First example or artificial neural network (Rosenblatt, in
the late 50’s).
• Initially studied to solve problems of pattern recognition
from visual stimuli.
• Made up of an input layer (retina) of neurons which
binarize the input (visual stimulus) through a function f and
are connected to a single output neuron with a step
activation function with threshold q.
f1 w1
f2
w2 S o
wn q
fn
Perceptron’s learning law (binary classification):
1. A training pattern is given in input and the corresponding

output is computed.
2. If the pattern has been classified correctly, repeat step 1. with
the next training pattern.
3. If the output is equal to 1 and the teaching input is equal to 0,
decrement the weights connected to high inputs by 1 and
increment the threshold by 1
4. If the output is equal to 0 and the teaching input is equal to 1
do the opposite (increment high-input weights and decrement
the threshold)
5. Repeat steps 1-4 until weight values converge.
Formally:
Op = 1 if net = Si wi Ipi > q p = pattern index

0 otherwise i = input index
Dpwi = (Tp - Op) Ipi T = teaching input
Dp q = (Op - Tp) I = input O = output

According to Rosenblatt’s theorem the algorithm converges
to the solution in a finite number of steps, if the solution
exists.
Unfortunately, the solution does not exist (e.g. XOR) if the
set of teaching inputs is not linearly separable.
Supervised learning
Training can be turned into an optimization task:
Minimize a function with a high-dimensional domain (as many
dimensions as the weights in the network) which measures the
difference between the teaching inputs in the training set and the
actual outputs of the network (error function).
An iterative procedure (gradient descent) moves the network
weights along the direction of the negative gradient of the
error function, thus guaranteeing that, at each step, the value
of the error function is decreased, until a local minimum is
reached.
Gradient descent is one of the so-called trajectory-based
optimization methods: a single point moves within the input
space; each steps aims at moving it to a better location.
oP,i
P n1 w
oP,i+1 w i,j oP,j
netP,j
oP,1 . .
i+1,j
. .
i1 oP,i+k wi+k,j
.
i2 . ni
. . oP,i nj
. . nN
. oP,N  OP
. .
. . .
. .
oP,i+k
. ni+k
iM
.
oP,M
nM
Supervised learning: Delta Rule
(Widrow & Hoff’s rule)
Given:
• A single-layer network with linear activations ( oj(W) = Si wij ii )
• a training set T = { (xp, tp) : p = 1, …., P} P = n. of examples
• a squared error function computed on the pth pattern
Ep = Sj =1,N (tpj -opj)2 / 2 N = n. of output neurons,
opj,tpj= output/teaching input for neuron j
• a global error function
E = Sp=1,P Ep = E(W), W = weight matrix of weights wij
associated with the connections ij
(from neuron i to neuron j)
E can be minimized using a gradient descent, converging to the local
minimum of E which is closest to the initial conditions
(corresponding to the values at which weights are initialized).
Supervised learning: gradient descent
x y
Basin of Basin of
attraction attraction
for x for y
The gradient of E has E/wij as components

E/wij = Sp Ep / wij
From the derivation rule for composed functions (E = E (O(W)) :
Ep/ wij =  Ep/  opj ·  opj/  wij
From the definition of the error function:
def
-Ep/ opj = (tpj - opj) = dpj (error produced by neuron j
when pattern p is input)
Since neuron activations are linear (opj = Si wij ipi)
 opj/  wij = ipi Therefore  Ep/  wij = - dpj ipi
thus E / wij = - Sp dpj ipi
If we apply the gradient descent

def
Dwij = - e E/wij = - Sp e Ep/wij = e Sp dpj ipi = Sp Dpwij
If e is small enough, one can modify weights after each single

pattern according to the rule:
Dpwij = e dpj ipi
NB All these quantitites are easily computed.

I. Initialize the weights
II. Repeat
For each pattern in the training set:
a. Compute the output produced by the present
configuration of the network
b. Compute the error function
c. Modify weights by moving them in a direction
that is opposite to the direction of the error
function gradient with respect to the weights
until the error function reaches a pre-set value OR
a preset maximum number of iterations is reached.
Problems
• It is not possible, in general, to compute the gradient of the
error function with respect to all the weights for any
network configuration. However, some configurations (see
the following slides) do allow gradients to be computed or
approximated.
• Even when this is possible, gradient descent will find a local
minimum, which may be very far from the global one.
Supervised learning:
Generalized Delta Rule (Backpropagation)
The delta rule can be applied only to a particular neural net
(single-layer with linear activation functions).
The delta rule can be generalized and applied to multi-layer
networks with non-linear activation functions.
This is possible for feedforward networks (aka Multi-layer
Perceptrons or MLP) where a topological ordering of neurons
can be defined, as well as a sequential order of activation.
The activation function f(net) of all neurons must be continuous,
differentiable and non-decreasing.
netpj= Si wij opi for a multi-layer network
(i = index of a neuron forward-connected to j , i.e., i<j if neurons are ordered
starting from the input layer)
We want to apply gradient descent to minimize the total
squared error (this holds for any topology):
Dpwij = - e Ep/wij
Notice that EP = EP (Op) = EP (f(netp(W))
Applying the differentiation rule for composite functions:
Ep/netpj
Ep/wij = Ep/opj · opj/netpj· netpj/wij
netpj/wij = /wij (Sk wkj opk ) = opi
If we define: dpj = - Ep/netpj we obtain:
Ep/wij = e dpj opi
(same formulation as for the Delta rule)
We know opi; we need to compute dpj

(defined as for the delta rule: in fact, for a linear neuron,
opj = netpj )
dpj = - Ep/netpj = - Ep/opj · opj/netpj
Notice that opj = fj(netpj) and
f depends on a single variable, thus
opj/netpj = dopj/dnetpj = f’j(netpj)
If the j-th neuron is an output neuron then

Ep/opj = - (tpj - opj)
Therefore, for such neurons, dpj = - (tpj - opj) f’j(netpj)
Instead, if the j-th neuron is a hidden neuron (here is where
topology becomes important),
Ep/opj = Sk Ep/netpk · netpk/opj = k>j
= Sk Ep/netpk · /opj Si wik opi =
= Sk Ep/netpk · wjk = - Sk dpk wjk In a MLP we
know the
This means that, for the hidden neurons,
values of all
dpj = f’j(netpj) · Sk dpk wjk k>j these terms!
Therefore, for the hidden neurons, dpj can be computed

recursively starting from the output neurons (error
backpropagation)
Backpropagation algorithm
1. Weights are initialized
2. The pth pattern is given as input
• the corresponding network outputs opj are computed
• The error function is computed for the output layer:
dpj = - (tpj - opj) f’j(netpj) (for the output layer)
• To compute deltas for the hidden layers:
dpj = f’j(netpj) · Sk dpk wjk are iteratively computed
starting from the hidden
layer closest to the outputs
3. Weights are modified as follows: Dpwij = e dpj opj
4. Steps 2. and 3. are repeated until convergence
Supervised learning: Backpropagation
Weights should be updated after processing all training patterns
(batch learning). In fact: Dwij = Sp Dpwij = e Sp dpj opj
If e is small the same result is obtained updating weights after
processing each pattern (online learning).
If weights are initialized to 0 convergence problems occur: usually
different small positive random values (in a range such as 0.05-
0.1) are used for weight initialization.
For this procedure to be an actual gradient descent e should be
infinitesimal, but the smaller e, the slower the convergence.
However, if e is too large one might ‘fly over’ a minimum.
We may want this to happen, if it is a local minimum (see the next
slides)
The gradient descent is not computationally efficient.
Practical problems when using the MLP
• Setting the network structure: how many neurons in how
many layers? The only strict constraint is imposed on the
number of neurons in the input and in the output layers.
Usually, the layers closer to the input are larger (they include
more neurons), to allow enough degrees of freedom to
recombine inputs appropriately; the following layers are
generally smaller, to favor generalization and limit overfitting.
Incremental algorithms exist that start from very few neurons

and add them step by step until good performances are
reached, or start from a large network and then ‘prune’ it by
removing connections associated with weights that are smaller
than a threshold.
• Setting the learning rate: a large learning rate makes it
easier for the network to escape local minima (which have
often a narrow ‘basin of attraction’) and jump into others; a
small learning rate generates a more literal gradient descent
into the closest local minimum. Usually we want to first
explore the space (large e) but then, when we have reached a
good point, we want to refine our search (small e).
• Setting a stopping condition: typically, a maximum number of
iterations must be set. Restarting from the same weight
configuration is always possible. Going back is not!
A good advice is to save the network configuration whenever a
new minimum is reached (NB You need a validation set here!)
and to finally stop the algorithm from running when overfitting
starts occurring.
In fact, usually, in the first (NB could be many!) iterations one can
observe a decrease in the error over both the training and the
validation set. After a certain number of iterations the network
will start overfitting the training data, i.e., the error will still be
decreasing on the training set, but will start increasing on the
validation set.
• The gradient descent is a deterministic algorithm, but weight
initialization is a random process, which means that the results
obtained by running the network several times on the same data
with different random sequences are instances of a random
variable, and must be treated as such!
• Hence, it makes no sense comparing different network
configurations based on a single run of each. Appropriate
statistics must be collected by running the BP algorithm several
times for each configuration and appropriate statistical tests must
be applied to show that the distributions of the results, and
especially their mean values, are significantly different.
• Never forget to consider that randomly drawing the instances of
the training/validation/test sets from a larger data set is already
enough to make the whole training procedure stochastic.

Network Learning (Training)

Uploaded by

Copyright:

Available Formats

Network Learning (Training)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Network Learning (Training)

Uploaded by

Copyright:

Available Formats

Network learning (training)

Each example has two components:

Examples comprise only input patterns.

Input Pattern Neural Network

Given a single-layer network with linear activation functions

Perceptron’s learning law (binary classification):

1. A training pattern is given in input and the corresponding

Op = 1 if net = Si wi Ipi > q p = pattern index

Dpwi = (Tp - Op) Ipi T = teaching input

Dp q = (Op - Tp) I = input O = output

The gradient of E has E/wij as components

If we apply the gradient descent

If e is small enough, one can modify weights after each single

Dpwij = e dpj ipi

NB All these quantitites are easily computed.

We know opi; we need to compute dpj

If the j-th neuron is an output neuron then

Therefore, for the hidden neurons, dpj can be computed

Incremental algorithms exist that start from very few neurons

You might also like