Neural Network Classification: Maximizing Zero-Error Density
Neural Network Classification: Maximizing Zero-Error Density
Neural Network Classification: Maximizing Zero-Error Density
Zero-Error Density
Minimization of error is equivalent to decreasing the difference between
target variable and observed variable. This is equivalent to decreasing
the error entropy. Thus, the output of the system is getting closer to the
target.
The task involves first to find the error distribution and center it around
zero by updating th weight accordingly in the direction to yield better
result. This creates a new cost function for neural network classification:
the error density at the origin. This method can be easily plugged in to
the usual back propagation algorithm, giving a simple and efficient
learning scheme. First a estimation of the curve is done through the
Parzen window estimator using a Gaussian kernel. The error is calculated
through the gradient descent method and back propagated.
Let X be the input train cases, w is the initial weight vector of the
network, e be the error vector calculated and f is the error density. As
the error distribution is not known, we rely on nonparametric estimation
using Parzen windowing function given by,
With the function K being the kernel function and here it is a Gaussian
kernel given by,
This new criterion can easily substitute MSE in the back propagation
algorithm. If w is some network weight then the derivative is given by,
Basically we get,
With
.
The algorithm has two parameters that should be optimally set: the
smoothing Parameter, h, of the kernel density estimator and the learning
rate, . We can benefit from an adaptive learning rate procedure. By
monitoring the value of the cost function, f(0), the adaptive procedure
ensures a fast convergence and a stable training. The rule is given by
If f(0) increases from one epoch to another, the algorithm is in the right
direction, so is increased by a factor u in order to speedup
convergence. However, if is large enough to decrease f(0), then the
algorithm makes a restart step and
decreases by a factor d to ensure that f(0) is being maximized. This
restart step is just a return to the weights of the previous epoch.
Although an exhaustive study of the behaviour of the performance
surface has not been made yet (this is a topic for future work), we believe
that the smoothing parameter h has a particular importance in the
convergence success.If h is increased to infinity, the local optima of the
cost function disappears, letting an unique but biased global maximum
to be found.