Single Perceptor

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Neural Computation

Single Layer Perceptron


Agenda
 Some historical notes
 Linear adaptive filters
 Rosenblatt’s perceptron
 Conclusions
Some historical notes
Rosenblatt’s perceptron 1958: a single neuron with
adjustable synaptic weights and bias
Perceptron convergence theorem 1962 (Rosenblatt):
If patterns used to train the perceptron are drawn
from two linearly separable classes, then the
algorithm converges and positions the decision
surface in the form of hyperplane between two
classes.
Some historical notes
 A single neuron -> an adaptive filter

 Widrow & Hoff (1960): least-mean square


algorithm (LMS) = delta rule
 The problem of designing an optimum
linear filter: Kolmogorov 1942, Wiener
1949, Zadeh 1953, Gabor 1954
Linear Adaptive Filters
Consider a dynamic system
with an unknown
mathematical
characterization
A set of labeled input-output
data generated by the
system at discrete instants
of time at some uniform
rate is available
Linear Adaptive Filters
The problem is also known as system identification
(in control theory).
An adaptive filter:
1) Filtering process (involves the
computation of two signals; an output y(i) and an
error signal e(i) = d(i)-y(i)) 2)
Adaptive process (the automatic adjustment of the
synaptic weights of the neuron in accordance with
the error signal e(i)).
Linear Adaptive Filters
 Note that y(i) = v(i) = km wk(i)xk(i) is
written into matrix form
y(i) = xT(i)w(i)
 The manner in which e(i) = d(i)-y(i) is used
to control the adjusments to the synaptic
weights is determined by the cost function
used to derive the adaptive filtering
algorithm.
Linear Adaptive Filters
 unconstrained optimization problem:
 Minimize the cost function E(w) with respect to the
weight vector w.
 Operation analysis -> the necessary condition for
optimality is E(w*) = 0.
 The strategy is to use iterative steepest descent:
Starting with an initial guess w(0), generate a
sequence of weight vectors w(1), w(2), ... , such
that the cost function E(w) is reduced at each
iteration of the algorithm as E(w(n+1) < Ew(n)).
Linear Adaptive Filters
 1) Method of Steepest
Descent
 The direction of steepest
descent is in direction
opposite to the gradient
vector g = E(w)
 w(n+1) = w(n) –g(n)
  is the stepsize or
learning-rate parameter
Linear Adaptive Filters
 The method of steepest descent converges to the
optimal solution w* slowly.
 The learning-rate parameter  has an influence on
the convergence behaviour:
- When  is small, the transient response of
the algorithm is overdamped (3.2a)
- When  is large, the transient response of
the algorithm is underdamped (3.2b)
– - When  exceeds a certain critical value, the
algorithm becomes unstable.
Linear Adaptive Filters
 2) Newton’s method
 The idea is to minimize the quadratic approximation of
the cost function E(w) around the current point w(n).
 Using a second-order Taylor series expansion of the
cost function around the point w(n).
  Ew(n)  gT(n)w(n) +½ wT(n) H(n) w(n)
 g(n) is the m-by-1 gradient vector of the cost function
E(w) evaluated at the point w(n). The matrix H(n) is the
m-by-m Hessian matrix of E(w) (second derivative), H
= ²E(w)
Linear Adaptive Filters
 H = ²E(w) requires the cost function E(w) to be
twice continuously differentiable with respect to the
elements of w.
 Differentiating  Ew(n)  gT(n)w(n) +½ wT(n)
H(n) w(n) with respect to w, the change E(n) is
minimized when
 g(n) + H(n)w(n) = 0 → w(n) = H-1(n)g(n)
 w(n+1) = w(n) + w(n)
 w(n+1) = w(n)+H-1(n)g(n)
 where H-1(n) is the inverse of the Hessian of E(w).
Linear Adaptive Filters
 Newton’s method converges quickly
asymtotically and does not exhibit the
zigzagging behavior.
 Newton’s method requires that the Hessian
H(n) has to be a positive definite matrix for
all n!
Linear Adaptive Filters
 3) Gauss-Newton Method
 It is applicable to a cost function that is expressed as
the sum of error squares.
 E(w) = ½ n e²(i), note that all the error terms are
i=1
calculated on the basis of a weight vector w that is
fixed over the entire observation interval 1 i  n.
 The error signal e(i) is a function of the adjustable
weight vector w. Given an operating point w(n), we
linearize the dependence of e(i) on w by writing
e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n
Linear Adaptive Filters
e’(n,w) = e(n) + J(n)(w-w(n))
where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)]T
and J(n) is the n-by-m Jacobian matrix of e(n) (The
Jacobian J(n) is the transpose of the m-by-n gradient
matrix e(n), where e(n) =[e(1), e(2), ...,e(n)].
w(n+1) = arg min w {½e’(n,w)²} =
½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-
w(n))TJT(n)J(n)(w-w(n))
Differentiating the expression with respect w and
setting the result equal to zero
Linear Adaptive Filters
JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0
w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)
The Gauss-Newton requires only the Jacobian matrix of
the error function e(n).
For the Gauss-Newton iteration to be computable, the
matrix product JT(n)J(n) must be nonsigular. JT(n)J(n)
is always nonnegative definite but to ensure that it is
nonsingular, the Jacobian J(n) must have row rank n.
→ add the diagonal matrix I to the matrix JT(n)J(n),
the parameter  is a small positive constant.
Linear Adaptive Filters
 JT(n)J(n)+ I ; positive definite for all n.
 -> The Gauss-Newton method is
implemented in the following form:
w(n+1) = w(n) – (JT(n)J(n) + I )-1JT(n)e(n)
 This is the solution to the modified cost
function:
 E(w) = ½{w-w(0)²+  n e²(i)}
i=1
 where w(0) is the initial value of w.
Linear Adaptive Filters
Linear Least-Square Filter:
- the single neuron is linear
- the cost function e(w)
used to design the filter consists of the
sum of error squares
e(n) = d(n) –X(n)w(n), where d(n) is the n-
by-1 desired response vector and X(n) is the
n-by-m data matrix.
Linear Adaptive Filters
w(n+1) =w(n) + (XT(n)X(n))-1XT(n)(d(n) –
X(n)w(n))
= (XT(n)X(n))-1XT(n)d(n) where
(XT(n)X(n))-1XT(n) is recognized as the
pseudoinverse of the data matrix X(n)
The weight vector w(n+1) solves the linear
least-squares problem defined over an
observation interval of duration n.
Linear Adaptive Filters
Wiener Filter: ergodic environment that is
also stationary -> We may substitute long-
term sample averages/time-averages for
expectations/ensemble averages
Correlation matrix of the input vector x(i): Rx
Cross-correlation vector between the input
vector x(i) and desired response d(i): rxd
Linear Adaptive Filters
Rx = E[x(i)xT(i)]
= lim n 1/n XT(n)X(n)
rxd = E[x(i)d(i)]
= lim n 1/n XT(n)d(n)
wo = lim n w(n+1)
= Rx-1 rxd
where Rx-1 is the inverse of the correlation matrix Rx.
The weight vector wo is called the Wiener solution
to the linear optimum filtering problem.
Linear Adaptive Filters
For an ergodic process, the linear least-squares filter
asymtotically approaches the Wiener filter as the
number of observations approaches infinity.
 The least-mean square (LMS) algorithm is based
on the use of instantaneous values for the cost
function.
 E(w) = ½ e²(n), where e(n) is the error signal
measured at time n.
Linear Adaptive Filters
 e(n) = d(n) - xT(n)w(n)
-> an estimate for the
gradient vector g^(n)
= -x(n)e(n)
 -> w^(n+1) = w^(n)
+x(n)e(n)
 where  is the learning
-rate parameter
Linear Adaptive Filters
The feedback loop around the weight vector
w^(n) in the LMS algorithm behaves like a
low-pass filter.
The inverse of the learning-rate parameter 
is a measure of the memory of the LMS
algorithm.
Linear Adaptive Filters
w^(n+1) = w^(n) + x(n)
[d(n) - xT(n) w^(n)]
w^(n) = z-1[w^(n+1)]
A stochastic feedback system
The learning-rate parameter
 and the input vector x(n)
that determine the
transmittance of the
feedback loop
Linear Adaptive Filters
The convergence issue that matters is
convergence in the mean square
E[e²(n)] constant as n
Assumptions: 1) the successive input vectors
are statistically independent of each other
2) At time step n the input vector is
statistically independent of all previous
samples of the desired response.
Linear Adaptive Filters
3) At time step n, the desired response is dependent on
the input, but statistically independent of all previous
values of the desired response.
4) The input vector and the desired response are drawn
from Gaussian distributed populations.
The convergence issue is guaranted if 0 <  < 2/(sum of
mean-square values of the sensor inputs)
The LMS algorithm is model independent and therefore
robust, which means that small model uncertainy and
small disturbances can only result in small estimation
errors.
Linear Adaptive Filters
LMS algorithm is slow in convergence and sensitive to
variations in the eigenstructure of the input.
The LMS algorithm typically requires a number of
iteration equal to about ten times the dimensionality
of the input space for it to reach a steady-state
condition.
The LMS algorithm is particularly sensitive to variations
in the condition number (=eigenvalue spread) of the
correlarion matrix Rx of the input vector x.
(Rx)maxmin
Linear Adaptive Filters
The learning curve is a plot of
the mean-square value of
the estimation error Eav(n)
versus the number of
iterations n.
Rate of convergence; the
number of iterations n
required for Eav(n) to
decrease to some
arbitrarily chosen value
Misadjustment M = E()/Emin
-1
Linear Adaptive Filters
Difficulties encountered with
the LMS algorithm may be
attributed to the fact that
the learning-rate parameter
is maintained constant.
(n) = c/n where c is a
conctant.
Search-then-converge
schedule
(n) = 0/(1+(n/t)) where t is
a search time constant
Perceptron
 Perceptron is based on
a nonlinear neuron.
 v = i=1m wixi + b
 In the simplest form of
the perceptron there
are two decision
regions reparated by a
hyperplane.
Perceptron
x(n) = [+1,x1(n),x2(n),...,xm(n)]T
w(n) =
[b(n),w1(n),w2(n),...,xm(n)]T
v(n) = wT(n)x(n) defines a
hyperplane
Perceptron
1. If the nth member of the  w(n+1) =w(n) if
training set, x(n), is wTx(n) >0 and x(n)
correctly classified by the
belongs to class L1
weight vector w(n)
computed at the nth
iteration of the algorithm,  w(n+1) =w(n) if
no correction is made to
the weight vector of the wTx(n)  0 and x(n)
perceptron in accordance belongs to class L1
with the rule:
Perceptron
  w(n+1) =w(n) -(n)x(n) if
2. Otherwise the
wT(n)x(n) >0 and x(n)
weight vector of the belongs to class L2
perceptron is uppdated  w(n+1) =w(n) + (n)x(n)
in accordance with the if wT(n)x(n)  0 and x(n)
rule belongs to class L1
 where the learning-rate
parameter (n) controls
the adjustment applied to
the weight vector at
iteration n.
Perceptron
Fixed-increment convergence theorem: Let
the subsets of training vectors H1 and H2 be
linearly separable. Let the inputs presented
to the perceptron originate from these two
subsets. The perceptron converges after
some n0 iterations, in the sense that w(n 0)
=w(n0 +1)=w(n0 +2)=... is a solution vector
for n0  nmax .
Perceptron
 The Perceptron
Convergence
Algorithm
Summary
The LMS algorithm uses a linear neuron.
Continuous learning takes place.

The perceptron uses the McCulloch-Pitts


formal model of a neuron.
The learning process is performed for a finite
number of iterations and then stops.

You might also like