On Convergence Properties of Pocket Algorithm: Marco Muselli
On Convergence Properties of Pocket Algorithm: Marco Muselli
On Convergence Properties of Pocket Algorithm: Marco Muselli
3, MAY 1997
623
Abstract The problem of finding optimal weights for a single threshold neuron starting from a general training set is
considered. Among the variety of possible learning techniques,
the pocket algorithm has a proper convergence theorem which
asserts its optimality. Unfortunately, the original proof ensures
the asymptotic achievement of an optimal weight vector only if the
inputs in the training set are integer or rational. This limitation
is overcome in this paper by introducing a different approach
that leads to the general result. Furthermore, a modified version
of the learning method considered, called pocket algorithm with
ratchet, is shown to obtain an optimal configuration within a finite
number of iterations independently of the given training set.
Index Terms Convergence theorems, neural networks, optimal learning, perceptron algorithm, pocket algorithm, threshold
neuron.
I. INTRODUCTION
624
(1)
(2)
(3)
(4)
cumulative error
can assume at most
different
values.
Now, let us denote with
the configuration generated by the perceptron algorithm at the th iteration
and with
the set of the feasible weight vectors
for the
threshold neuron
. Furthermore, let
be the subset of
the training set that contains the samples correctly classified
by
.
The following definitions will be useful.
we shall say
Definition 3: If
that a visit of the configuration
has occurred at the th
iteration of the perceptron algorithm.
If
we shall say that a
permanence of the configuration has occurred at iteration .
A run of length for the weight vector
is then given
by a visit of followed by
permanencies. It can easily
be seen that the probability of having a permanence for
is
given by
, where
is the cardinality of the set
.
On the contrary, the probability of having a visit for is not
known a priori.
III. CONVERGENCE THEOREMS FOR THE POCKET ALGORITHM
The existence of a convergence theorem for the pocket
algorithm [23] which ensures the achievement of an optimal
configuration when the number of iterations increases indefinitely has been an important motivation for the employment
of this learning procedure. However, this theorem can be
applied only when the inputs in the training set are integer
or rational. A generalization to the real case is possible by
introducing a different approach which can be extended to
show the optimality of other similar learning methods.
Consider the
subsets
of the given training set ,
with
; each of these can be associated with
a subset
of feasible configurations such that for
every
we have
. Naturally, if
is not
linearly separable the corresponding subset
will be empty.
In the same way, if is the number of samples in correctly
is empty
classified by an optimal neuron we obtain that
for every
such that
.
For the perceptron convergence theorem [18], if the set
is linearly separable, a number
of iterations exists such
that if
randomly selected samples in
are subsequently
presented to the perceptron algorithm, then a configuration
is generated. Since the number of sets
is
finite, we can denote with the maximum value of
for
.
Furthermore, let us introduce the sets
, with
, containing all the weight vectors
which
correctly classify samples of the training set
625
the probability
of having a permanence for
can be
determined a priori and is given by
.
At any iteration of the perceptron algorithm, a sample in
is randomly chosen with uniform probability. If we define
success the choice of an inputoutput pair in
and failure
the opposite event, we obtain a sequence of Bernoulli trials
with probability
of success.
Let
denote the probability that the length of the
trials is less than . The
longest run of successes in
complexity of the exact expression for
[28] prevents us
from using it in the mathematical passages; however, useful
bounds for
can be obtained by some simple remarks.
Let us subdivide the sequence of trials of the given experiment into nonoverlapping subsets of length ; the maximum
number of them is
, having denoted with
the integer
not greater than . If the original sequence does not contain a
run of successes, none of these subsets can be formed by
successes. Since the probability of this event is
and the
subsets are stochastically independent, we obtain
.
To find a lower bound for
, let us denote with
, the th outcome of the given experiment
and consider the
(overlapping) subsequences of
consecutive trials starting at
. It can easily
be seen that, if none of these
subsequences contains
a run of
successes, neither does the complete sequence.
Thus, we could conclude that
if the
subsequences were stochastically independent (which
is generally not true). However, this lower bound is always
valid as the following lemma asserts.
Lemma 1: If
and
the following
inequalities hold:
(5)
Proof: From a direct analysis of the given binary experiment we obtain the particular expressions
(6)
and
; consequently the inequalities
for
(5) are verified for
and
.
In the general case
and
let us denote
with
the th outcome of the given experiment and consider
the events
is a success
a run of successes in the
exists
outcomes
with
probability
and
we obtain
. By definition of the
(7)
With these notations it can be directly seen that
, having denoted with
the complement of the
set
. But
and
are two stochastically independent
626
we have
(11)
(8)
where we have used (6). By repeatedly applying (8) we obtain
the upper bound for
since
and
for
.
The lower bound in (5) can be found by noting that
; the application of (6) and (7) therefore gives
(9)
But the set
contains all the sequences of trials of
the given experiment which begins with successes and has
outcomes. Thus,
at least a run of successes in the last
holds:
the following lower bound for
(12)
(10)
since it can easily be seen that
By substituting (10) in (9) we obtain
and after
applications
(14)
is contained in
and
or
or
627
since
(18)
and
Note that
Then we obtain for
when
(15)
Consider the situation (event ) in which the first run with
length for a weight vector
begins at the th iteration;
this can be true only if samples in
have been chosen at the
iterations
and at the iteration
, an
has been selected.
inputoutput pair not belonging to
Since no general conditions are given on the intersection
between the sets
and
, some samples chosen in
can also belong to
. However, for the inequality (15) the
is increased by supposing that at the
probability
iterations
we have obtained failures.
In a similar way, the visit of the vector
at iteration
can influence the choices made by perceptron learning at
iterations
. Thus we can obtain a further
increase in the probability
if we suppose that
failures take place at these iterations. On the other hand, the
first
random choices are independent of the visit of the
configuration since subsequent choices in
are sufficient
to reach starting from any weight vector.
Consequently we obtain from (13) and (5)
but by definition of
hence we have
since
and
In the case
bound in (5) for
is surely not
of not having
con.
(16)
By substituting (16) in (12) an upper bound for the probability
is obtained
for which we have
(17)
Now, let us verify that the right-hand side of (17) vanishes
when the number of iterations increases indefinitely. To this
end, let us break the summation above into two contributions
Theorem 1 assures good convergence properties for the configuration generated by the pocket algorithm: the probability
that the pocket vector is optimal increases with the execution
time. Unfortunately, no useful theoretical evaluations on the
628
629