On Convergence Properties of Pocket Algorithm: Marco Muselli

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO.

3, MAY 1997

623

On Convergence Properties of Pocket Algorithm


Marco Muselli

Abstract The problem of finding optimal weights for a single threshold neuron starting from a general training set is
considered. Among the variety of possible learning techniques,
the pocket algorithm has a proper convergence theorem which
asserts its optimality. Unfortunately, the original proof ensures
the asymptotic achievement of an optimal weight vector only if the
inputs in the training set are integer or rational. This limitation
is overcome in this paper by introducing a different approach
that leads to the general result. Furthermore, a modified version
of the learning method considered, called pocket algorithm with
ratchet, is shown to obtain an optimal configuration within a finite
number of iterations independently of the given training set.
Index Terms Convergence theorems, neural networks, optimal learning, perceptron algorithm, pocket algorithm, threshold
neuron.

I. INTRODUCTION

N recent years, the interest of the scientific community


in neural networks has increased considerably: the main
reason for this fact certainly lies in the successful results
obtained in several application fields. Speech recognition [1],
handwritten character reading [2], [3], and chaotic signal
prediction [4] are only three among the variety of problems
where neural networks have shown to be an efficient way for
the treatment of real-world data.
The robustness and reliability shown in these particular
applications is also supported by many theoretical results
which characterize neural networks as a promising tool for
automatic statistical analysis. In particular, the representation
ability of connectionist models has been well established:
every (Borel) measurable function can be approximated to
within an arbitrary precision by a feedforward neural network
with a single hidden layer containing a sufficient number of
units [5][7].
Furthermore, the application of risk minimization theory,
developed in the 1970s [8], [9], has allowed the univocal
definition of the concepts of learning and induction for neural
networks. The basic hypothesis of this analysis is the existence
of a noisy unknown function from which the inputoutput pairs
of the training set have been extracted. The learning process
attempts to estimate this underlying function starting from the
available samples.
Several induction principles can be introduced to control the
learning process: three possible choices have been proposed
by Vapnik [9][11]: the empirical risk minimization, the
structural risk minimization, and the local risk minimization.
Manuscript received August 9, 1995; revised August 12, 1996 and November 12, 1996.
The author is with the Istituto per i Circuiti Elettronici, Consiglio Nazionale
delle Ricerche, 16149 Genova, Italy.
Publisher Item Identifier S 1045-9227(97)02767-7.

Although these three principles are essentially different, their


computational kernel is based on the search for the weights
of a neural network (with fixed architecture) that minimizes
the total error on the training set. In fact, it has been found
that performing this task allows the probability of giving a
good response to a new input pattern to be maximized, if
the VapnikChervonenkis (VC) dimension of the resulting
configuration is controlled simultaneously.
However, such an optimization is very difficult to achieve
because of the complexity of the given cost function: the task
of deciding if a given function can be realized by a given
neural network is NP-complete [12], [13]. Nevertheless, the
use of local or global optimization algorithms, such as the
steepest descent method or the conjugate gradient method, has
allowed optimal or near-optimal results in many interesting
cases to be obtained [14], [15].
If we limit ourselves to considering the case of a single
neuron, in the literature we find some optimal algorithms
which minimize the cumulative error on the training set. When
we are facing a regression problem, in which the unknown
function is real-valued, the WidrowHoff rule or least mean
square (LMS) algorithm [15] is able to find the optimal
solution if the loss function for the computation of the error is
the usual mean square error (MSE). Since the convergence is
reached in an asymptotic way, at least in the worst case, the
LMS algorithm will be called (asymptotically) optimal.
The achievement of the desired solution is guaranteed by
the presence of a single local-global minimum in the cost
function which turns to be quadratic. So, a local optimization
method, like the steepest descent used by the LMS algorithm,
allows optimal weights for the neuron to be obtained. Different
techniques, such as the conjugate gradient methods [16], [17],
can reach the minimum point (at least theoretically) in a finite
number of steps: therefore we shall call them finite-optimal
algorithms.
In the complementary case of classification problems, in
which the output is binary-valued, the perceptron algorithm is
finite-optimal [18], [19] only if the training set is linearly separable; in this case there exists a weight vector which satisfies
all the inputoutput pairs of the training set. In the opposite
case, the perceptron algorithm cycles indefinitely in the set
of feasible weight vectors. Some alternatives or modifications
[20][22] have been proposed to stop the learning when the
training set is not linearly separable, but none of these is able
to provide the desired optimal solution with probability one.
An important possibility is offered by the pocket algorithm
[23], [24]: it runs perceptron learning iteratively and holds the
weight vector which has remained unchanged for the greatest number of iterations in memory. A proper convergence

10459227/97$10.00 1997 IEEE

624

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997

theorem ensures that the pocket algorithm is optimal if the


components of the input patterns in the training set are integer
or rational. In this case the probability that the saved weight
vector, called pocket vector, is optimal approaches one when
the number of iterations increases. This important theoretical
property has been widely employed to show the good qualities
of some constructive methods which use the pocket algorithm
as a basic component [25][27].
In classification problems a reasonable measure of error
is given by the number of input patterns of the training set
misclassified by the given neural network. Unfortunately, the
pocket algorithm does not ensure that the error associated
with the current pocket vector is smaller than previously
saved configurations. Thus, its convergence properties can be
improved if a proper check is executed before saving a new
weight vector; this version of the method is called pocket
algorithm with ratchet [24].
The aim of this paper is to extend the theoretical result
provided by the convergence theorem of Gallant [23]. In
particular a new approach allows the establishment of the
asymptotic optimality of the pocket algorithm for general
values of the input patterns in the training set. Furthermore,
a new theorem shows that the version with ratchet is finiteoptimal, i.e., it reaches the desired configuration within a
finite number of iterations. An effective stopping criterion is
however not available, since the minimum error that can be
obtained by a threshold neuron on a given training set is not
known a priori.
The proposition and consequent proof of the two convergence theorems, contained in Section III, is preceded by a
formal definition of optimality that will be employed for a
theoretical analysis of the pocket algorithm (Section II).
II. OPTIMALITY DEFINITIONS
Let
denote a neural network with a fixed architecture,
whose inputoutput transformation depends uniquely on a set
of weights contained in a vector . Furthermore, let
be
the cumulative error scored by
on a given training set
containing a finite number of samples. The criteria followed
for the computation of such an error do not depend on the
particular training set , but only on the kind of problem we
are dealing with (classification, regression, etc.).
Suppose without loss of generality that
for every
weight vector
and denote with
the infimum of

A learning algorithm for the neural network


provides
at every iteration a weight vector
which depends on
the particular rule chosen to minimize the cumulative error .
Thus we can introduce the following definitions, where the
probability
is defined over the possible vector sequences
generated by the given procedure.
Definition 1: A learning algorithm for the neural network
is called (asymptotically) optimal if
for every
independently of the given training set

(1)

Definition 2: A learning algorithm for the neural network


is called finite-optimal if there exists such that
for every

(2)

independently of the given training set .


In the case of pocket algorithm the neural network
is
formed by a single threshold neuron whose input weights and
bias are contained in the vector . The activation function for
this neuron is bipolar and is given by
if
otherwise

(3)

, the neuron inputs.


having denoted with
For the sake of simplicity we have also included the bias
in the weight vector by adding a fixed component
to every input pattern.
To reach an optimal configuration the pocket algorithm
repeatedly executes perceptron learning whose basic iteration
is formed by the following two steps.
1) A sample
is randomly chosen in the training set
with uniform probability.
2) If the current threshold neuron
does not satisfy
this sample its weight vector
is modified according
to the following rule:
for

(4)

It can be shown that the substitution rule (4) changes the


current neuron so as to satisfy (eventually after several weight
changes) the inputoutput pair
chosen at Step 1).
It is known that the perceptron learning algorithm is finiteoptimal when the training set is linearly separable. In this
case, a threshold neuron that correctly satisfies all the samples
contained in the training set
exists. In the opposite case
the perceptron algorithm cycles indefinitely through feasible
weight vectors without providing any (deterministic or probabilistic) information about their optimality. However, we can
observe that a neuron
which satisfies a high number of
samples of the training set has a small probability of being
modified at Step 2).
This consideration motivates the procedure followed by
the pocket algorithm: it holds the weight vector
which
has remained unchanged for the greatest number of iterations
during perceptron learning in memory. Such a vector
is
called pocket vector and the corresponding neuron
forms the output of the training.
To analyze the convergence properties of the pocket algorithm in a theoretical way we have to define a proper
cumulative error
. Since the output of the neuron
is binary (3), every training set can be associated with a
classification problem. Thus, a good measure of the error made
by a given configuration can be the number of misclassified
samples of the training set .
In this way, if the vector correctly satisfies inputoutput
pairs of the training set , we have
. In particular,
if is the number of samples successfully classified by an
optimal vector we obtain
. Consequently, the

MUSELLI: CONVERGENCE PROPERTIES OF POCKET ALGORITHM

cumulative error
can assume at most
different
values.
Now, let us denote with
the configuration generated by the perceptron algorithm at the th iteration
and with
the set of the feasible weight vectors
for the
threshold neuron
. Furthermore, let
be the subset of
the training set that contains the samples correctly classified
by
.
The following definitions will be useful.
we shall say
Definition 3: If
that a visit of the configuration
has occurred at the th
iteration of the perceptron algorithm.
If
we shall say that a
permanence of the configuration has occurred at iteration .
A run of length for the weight vector
is then given
by a visit of followed by
permanencies. It can easily
be seen that the probability of having a permanence for
is
given by
, where
is the cardinality of the set
.
On the contrary, the probability of having a visit for is not
known a priori.
III. CONVERGENCE THEOREMS FOR THE POCKET ALGORITHM
The existence of a convergence theorem for the pocket
algorithm [23] which ensures the achievement of an optimal
configuration when the number of iterations increases indefinitely has been an important motivation for the employment
of this learning procedure. However, this theorem can be
applied only when the inputs in the training set are integer
or rational. A generalization to the real case is possible by
introducing a different approach which can be extended to
show the optimality of other similar learning methods.
Consider the
subsets
of the given training set ,
with
; each of these can be associated with
a subset
of feasible configurations such that for
every
we have
. Naturally, if
is not
linearly separable the corresponding subset
will be empty.
In the same way, if is the number of samples in correctly
is empty
classified by an optimal neuron we obtain that
for every
such that
.
For the perceptron convergence theorem [18], if the set
is linearly separable, a number
of iterations exists such
that if
randomly selected samples in
are subsequently
presented to the perceptron algorithm, then a configuration
is generated. Since the number of sets
is
finite, we can denote with the maximum value of
for
.
Furthermore, let us introduce the sets
, with
, containing all the weight vectors
which
correctly classify samples of the training set

It can easily be seen that


for
.
Consider the Markov chain, the states of which are given
by the sets
to which the configurations
generated
by the perceptron algorithm at each iteration belong. As one
can note, this chain has stationary transition probabilities the
values of which depend on the given training set . Only

625

the probability
of having a permanence for
can be
determined a priori and is given by
.
At any iteration of the perceptron algorithm, a sample in
is randomly chosen with uniform probability. If we define
success the choice of an inputoutput pair in
and failure
the opposite event, we obtain a sequence of Bernoulli trials
with probability
of success.
Let
denote the probability that the length of the
trials is less than . The
longest run of successes in
complexity of the exact expression for
[28] prevents us
from using it in the mathematical passages; however, useful
bounds for
can be obtained by some simple remarks.
Let us subdivide the sequence of trials of the given experiment into nonoverlapping subsets of length ; the maximum
number of them is
, having denoted with
the integer
not greater than . If the original sequence does not contain a
run of successes, none of these subsets can be formed by
successes. Since the probability of this event is
and the
subsets are stochastically independent, we obtain
.
To find a lower bound for
, let us denote with
, the th outcome of the given experiment
and consider the
(overlapping) subsequences of
consecutive trials starting at
. It can easily
be seen that, if none of these
subsequences contains
a run of
successes, neither does the complete sequence.
Thus, we could conclude that
if the
subsequences were stochastically independent (which
is generally not true). However, this lower bound is always
valid as the following lemma asserts.
Lemma 1: If
and
the following
inequalities hold:
(5)
Proof: From a direct analysis of the given binary experiment we obtain the particular expressions
(6)
and
; consequently the inequalities
for
(5) are verified for
and
.
In the general case
and
let us denote
with
the th outcome of the given experiment and consider
the events
is a success
a run of successes in the
exists
outcomes
with
probability

and
we obtain

. By definition of the

(7)
With these notations it can be directly seen that
, having denoted with
the complement of the
set
. But
and
are two stochastically independent

626

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997

events; so we have from (7)

Then it is sufficient to show that for every

we have
(11)

(8)
where we have used (6). By repeatedly applying (8) we obtain
the upper bound for
since
and
for
.
The lower bound in (5) can be found by noting that
; the application of (6) and (7) therefore gives
(9)
But the set
contains all the sequences of trials of
the given experiment which begins with successes and has
outcomes. Thus,
at least a run of successes in the last
holds:
the following lower bound for

. To this end, let us denote with and the


if
probabilities of having a permanence for the weight vectors
of
and
, respectively. Thus we have
and
with
.
Now, if
the training set is linearly separable and
the perceptron convergence theorem ensures the achievement
of an optimal configuration
within a finite number
of iterations. The subsequent random choices of samples in
the training set leave this optimal vector unchanged, since its
permanence probability is unitary. Thus, (11) is verified since
at a given iteration an optimal configuration becomes the
and its number of permanencies
current pocket vector
will increase indefinitely.
In the case
we can write

(12)

(10)
since it can easily be seen that
By substituting (10) in (9) we obtain
and after

applications

This lemma gives a mathematical basis for the proof of the


asymptotic optimality of the pocket algorithm:
Theorem 1 (Pocket Convergence Theorem): If the training
set is finite, the pocket algorithm is (asymptotically) optimal,
i.e.,
for every
where
is the pocket vector held in memory at iteration
.
the length of the longest run
Proof: Denote with
belonging to the set
in the first
for any configuration
iterations of the perceptron algorithm. If
contains all
the optimal vectors for the training set we can find a subset
having cardinality
for which the associated
of
is not empty. Thus we can write for every
subset

Since the set


is linearly separable, if the perceptron
algorithm consecutively chooses
samples in
we
obtain a run with length
for a configuration
.
Let us employ again the term success to denote the random
choice of an inputoutput pair in
; then we can define the
following events:
at the th iteration a success occurs ;
between the iterations and (included) there is not
a run of
successes ;
the first run with length for a vector
begins at the th iteration
with
and
.
By using the previously given definition of the probability
we obtain
(13)
With these notations we can write

(14)
is contained in

since the event

and

or
or

Now, let us consider the event


and suppose a success
occurs at the th iteration (with
). By denoting
with
the complement of the set , for the independence
of the choices made at different iterations by the perceptron
algorithm, we obtain

MUSELLI: CONVERGENCE PROPERTIES OF POCKET ALGORITHM

627

since

. As a matter of fact, if iterations


do not contain a run of
successes, neither
does a subset of them. Thus, if we suppose a failure occurs at a
given iteration , an upper bound for the conditional probability
of the event
is obtained

having values of the integer


, where

belonging to the intervals

(18)
and
Note that
Then we obtain for

when

(15)
Consider the situation (event ) in which the first run with
length for a weight vector
begins at the th iteration;
this can be true only if samples in
have been chosen at the
iterations
and at the iteration
, an
has been selected.
inputoutput pair not belonging to
Since no general conditions are given on the intersection
between the sets
and
, some samples chosen in
can also belong to
. However, for the inequality (15) the
is increased by supposing that at the
probability
iterations
we have obtained failures.
In a similar way, the visit of the vector
at iteration
can influence the choices made by perceptron learning at
iterations
. Thus we can obtain a further
increase in the probability
if we suppose that
failures take place at these iterations. On the other hand, the
first
random choices are independent of the visit of the
configuration since subsequent choices in
are sufficient
to reach starting from any weight vector.
Consequently we obtain from (13) and (5)

where we have used the upper bound in (5) for


and the
inequality
which holds for any admissible value
of
and .
Now, it can easily be shown that

but by definition of

hence we have

since
and
In the case
bound in (5) for

In fact, the probability


greater than the probability
where
secutive choices in
But from (18)

we obtain, by using the lower

is surely not
of not having
con.

and (14) becomes

(16)
By substituting (16) in (12) an upper bound for the probability
is obtained
for which we have

(17)
Now, let us verify that the right-hand side of (17) vanishes
when the number of iterations increases indefinitely. To this
end, let us break the summation above into two contributions

Theorem 1 assures good convergence properties for the configuration generated by the pocket algorithm: the probability
that the pocket vector is optimal increases with the execution
time. Unfortunately, no useful theoretical evaluations on the

628

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997

number of iterations required to obtain a given value of this


convergence probability exist.
We must also emphasize that the cumulative error
associated with the pocket vector
does not decrease
monotonically with the number of iterations. In fact, in
practical applications we observed that a new pocket vector
often satisfies a smaller number of samples of the training set
with respect to the previous saved configuration.
To eliminate this undesired effect, a modified version of
the learning method, called pocket algorithm with ratchet, has
been proposed. It computes the corresponding error
at every possible saving of a new pocket vector
and
maintains the previous configuration when the number of
misclassified samples does not decrease.
In this way the stability of the algorithm is improved as the
following theorem asserts.
Theorem 2: If the training set is finite, the pocket algorithm
with ratchet is finite-optimal, i.e., there exists an integer such
that
for every
Proof: From the procedure followed by the pocket algorithm with ratchet, we obtain that the cumulative error
associated with the pocket vector
decreases
monotonically toward a minimum
corresponding
to the last saved configuration. Thus, if is the number of
samples in the training set correctly classified by an optimal
neuron, we can write
for every
for every
for every
where
is associated with a subset
of the training
set having cardinality
.
Let
be the length of the run scored by
; then
the probability that
for every
is surely
not greater than the probability that a sequence of
consecutive choices of samples in
does not take place
after iteration . Thus, by setting
from the definition
of
we obtain
for every
for every

having used the upper bound in (5).


IV. CONCLUSION
The interest of the pocket algorithm as a learning method
for neurons with hard threshold activation function is mainly
based on the presence of a proper convergence theorem, which
ensures the achievement of an optimal configuration when the
number of iterations increases indefinitely.
Unfortunately the original proof of this theorem only holds
if the inputs in the training set are integer or rational and does

not allow to conclude the optimality of the pocket algorithm


in all the possible situations. This work introduces an alternative approach which allows us to overcome the limitations
of the original treatment. In particular the convergence in
probability to the optimal weight vector is theoretically shown
independently of the given training set.
Furthermore, a second theorem asserts that the pocket
algorithm with ratchet finds an optimal weight vector within a
finite number of iterations with probability one. Consequently,
this is surely the learning method to prefer when the size of the
training set does not lead to an improposable execution time.
On this point, it should be observed that the proofs contained
in the previous section do not provide any information about
the number of iterations required to reach an optimal or nearoptimal weight vector. This fact prevents us from establishing
a direct stopping criterion for the training with the pocket
algorithm in practical situations.
ACKNOWLEDGMENT
The author wishes to thank Dr. S. Gallant for the fruitful
discussions and his valuable comments.
REFERENCES
[1] R. P. Lippmann, Review of neural networks for speech recognition,
Neural Computa., vol. 1, pp. 138, 1989.
[2] L. C. Yann, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
Hubbard, and L. D. Jackel, Backpropagation applied to handwritten zip
code recognition, Neural Computa., vol. 1, pp. 541551, 1989.
[3] I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla, Structural
risk minimization for character recognition, in Advances in Neural
Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R.
P. Lippmann, Eds. San Mateo, CA: Morgan Kaufmann, 1992, pp.
471479.
[4] A. Lapedes and R. Farber, How neural nets work, in Neural Information Processing Systems, D. Z. Anderson, Ed. New York: Amer. Inst.
Physics, 1987, pp. 442456.
[5] G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Contr., Signals, Syst., vol. 2, pp. 303314, 1989.
[6] K. Funahashi, On the approximate realization of continuous mapping
by neural networks, Neural Networks, vol. 2, pp. 183192, 1989.
[7] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward
networks are universal approximators, Neural Networks, vol. 2, pp.
359366, 1989.
[8] V. N. Vapnik and A. Y. Chervonenkis, On the uniform convergence of
relative frequencies of events to their probabilities, Theory Probability
Applicat., vol. 16, pp. 264280, 1971.
[9] V. Vapnik, Estimation of Dependences Based on Empirical Data. New
York: Springer-Verlag, 1982.
, Principles of risk minimization for learning theory, in Ad[10]
vances in Neural Information Processing Systems 4, J. E. Moody,
S. J. Hanson, and R. P. Lippmann, Eds. San Mateo, CA: Morgan
Kaufmann, 1992, pp. 831838.
[11] V. Vapnik and L. Bottou, Local algorithms for pattern recognition and
dependencies estimation, Neural Computa., vol. 5, pp. 893909, 1993.
[12] J. S. Judd, Learning in networks is hard, in Proc. 1st Int. Conf. Neural
Networks, San Diego, CA, 1987, pp. 685692.
[13] A. Blum and R. L. Rivest, Training a 3-node neural network is NPcomplete, Neural Networks, vol. 5, pp. 117127, 1992.
[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal
representations by error propagation, in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA:
MIT Press, 1986, pp. 318362.
[15] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of
Neural Computation. Reading, MA: Addison-Wesley, 1991.
[16] D. G. Luenberger, Introduction to Linear and Nonlinear Programming,
Reading, MA: Addison Wesley, 1984.
[17] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling,
Numerical Recipes in C. Cambridge, U.K.: Cambridge Univ. Press,
1988.

MUSELLI: CONVERGENCE PROPERTIES OF POCKET ALGORITHM

[18] F. Rosenblatt, Principles of Neurodynamics. Washington, D.C.: Spartan, 1961.


[19] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press, 1969.
[20] Y.-C. Ho and R. L. Kashyap, An algorithm for linear inequalities and
its applications, IEEE Trans. Electron. Computers, vol. EC-14, pp.
683688, 1965.
[21] L. G. Khachiyan, A polinomial algorithm in linear programming, Sov.
Math. Doklady, vol. 20, pp. 191194, 1979.
[22] A. J. Mansfield, Comparison of perceptron training by linear programming and by the perceptron convergence procedure, in Proc. Int. Joint
Conf. Neural Networks, Seattle, WA, 1991, pp. II-25II-30.
[23] S. I. Gallant, Perceptron-based learning algorithms, IEEE Trans.
Neural Networks, vol. 1, pp. 179191, 1990.
[24]
, Neural Networks Learning and Expert Systems. Cambridge,
MA: MIT Press, 1993.
[25] M. Mezard and J.-P. Nadal, Learning in feedforward layered networks:
The tiling algorithm, J. Phys. A, vol. 22, pp. 21912203, 1989.
[26] M. Frean, The upstart algorithm: A method for constructing and
training feedforward neural networks, Neural Computa., vol. 2, pp.
198209, 1990.

629

[27] M. Muselli, On sequential construction of binary neural networks,


IEEE Trans. Neural Networks, vol. 6, pp. 678690, 1995.
, Simple expressions for success run distributions in Bernoulli
[28]
trials, to appear in Statist. Probability Lett., 1996.

Marco Muselli was born in 1962. He received


the M.S. degree in electronic engineering from the
University of Genoa, Italy, in 1985.
He is currently a Researcher at the Istituto per i
Circuiti Elettronici of CNR in Genoa. His research
interests include neural networks, global optimization, genetic algorithms, and characterization of
nonlinear systems.

You might also like