Neural Network Classification: Maximizing Zero-Error Density

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Neural Network Classification: Maximizing

Zero-Error Density
Minimization of error is equivalent to decreasing the difference between
target variable and observed variable. This is equivalent to decreasing
the error entropy. Thus, the output of the system is getting closer to the
target.

The task involves first to find the error distribution and center it around
zero by updating th weight accordingly in the direction to yield better
result. This creates a new cost function for neural network classification:
the error density at the origin. This method can be easily plugged in to
the usual back propagation algorithm, giving a simple and efficient
learning scheme. First a estimation of the curve is done through the
Parzen window estimator using a Gaussian kernel. The error is calculated
through the gradient descent method and back propagated.

Consider a multi-layer perceptron (MLP) with one hidden layer, a single


output y and a two-class target variable (thus y can be either be 1 or 0),
t. Measure the error as e = t y, n = 1, . . . , N where N is the total number
of examples. Adapting the system to minimize the error entropy is
equivalent to adjusting the network weights in order to concentrate the
errors, giving a distribution with a higher peak of the error distribution
at the origin. This reasoning leads to the adaptive criteria of maximizing
the error density value at the origin. This principle as Zero-Error Density
Maximization (Z-EDM).So the objective function becomes,

Let X be the input train cases, w is the initial weight vector of the
network, e be the error vector calculated and f is the error density. As
the error distribution is not known, we rely on nonparametric estimation
using Parzen windowing function given by,

With the function K being the kernel function and here it is a Gaussian
kernel given by,

This is a useful choice, because it is continuously differentiable, an


essential property when deriving the gradient of the cost function.
Hence, the new cost function for neural network classification becomes,

This new criterion can easily substitute MSE in the back propagation
algorithm. If w is some network weight then the derivative is given by,

Basically we get,

With

For the case of MSE a (n) = 1, n. The computation of e(n)/w is as usual


for the backpropagation algorithm. The procedure is easily extended for
multiple output networks. Taking a target encoding for class Ck as [1, .
. . , 1, . . . ,1] where the 1 appears at the k-th component and using the
multivariate Gaussian kernel with identity covariance, the gradient is
straightforward
to
compute,

Where M is the number of output units and e(n) = (e1(n), . . . , eM(n)).


Having
determined for all network weights, the weight update is given, for the
m-th iteration, by the gradient ascent rule

.
The algorithm has two parameters that should be optimally set: the
smoothing Parameter, h, of the kernel density estimator and the learning
rate, . We can benefit from an adaptive learning rate procedure. By
monitoring the value of the cost function, f(0), the adaptive procedure
ensures a fast convergence and a stable training. The rule is given by

If f(0) increases from one epoch to another, the algorithm is in the right
direction, so is increased by a factor u in order to speedup
convergence. However, if is large enough to decrease f(0), then the
algorithm makes a restart step and
decreases by a factor d to ensure that f(0) is being maximized. This
restart step is just a return to the weights of the previous epoch.
Although an exhaustive study of the behaviour of the performance
surface has not been made yet (this is a topic for future work), we believe
that the smoothing parameter h has a particular importance in the
convergence success.If h is increased to infinity, the local optima of the
cost function disappears, letting an unique but biased global maximum
to be found.

You might also like