INFORMATION
SCIENCES
AN ]NI~KN~TIONA[ J~URNAI
ELSEVIER
Journal of Information Sciences 109 (1998) 165 184
Simulated annealing based pattern
classification
Sanghamitra Bandyopadhyay a, Sankar K. Pal a,,
C.A. Murthy b,1
" Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road,
Calcutta 700 035, India
b Department of Statisties, 326, Thomas Building, Pennsylvania State University, UniversiO, Park.
PA-16802-111, USA
Received 11 February 1997; received in revised form 20 September 1997:
accepted 18 November 1997
Abstract
A method is described for finding decision boundaries, approximated by piecewise
linear segments, for classifying patterns in ~N,N>~ 2, using Simulated Annealing
(SA). It involves generation and placement of a set of hyperplanes (represented by
strings) in the feature space that yields minimum misclassification. Theoretical analysis
shows that as the size of the training data set approaches infinity, the boundary provided by the SA based classifier will approach the Bayes boundary. The effectiveness of the
classification methodology, along with the generalization ability of the decision boundary, is demonstrated for both artificial data and real life data sets having non-linear/
overlapping class boundaries. Results are compared extensively with those of the Bayes
classifier, k-NN rule and multilayer perceptron, and Genetic Algorithms, another popular evolutionary technique. Empirical verification of the theoretical claim is also provided. © 1998 Elsevier Science Inc. All rights reserved.
Keywords: Pattern recognition; Evolutionary computation; Genetic algorithm; Bayes
error probability
*Corresponding author. Tel.: +91 33 577 8085 3100; fax: +91 33 556 6680/6925; e-mail:
[email protected].
On leave from Indian Statistical Institute.
0020-0255/98/$19.00 © 1998 Elsevier Science Inc. All rights reserved.
PII:S0020-0255(98)0001 7-6
166
S. Bandyopadhyay et al. / Journal o f lnformation Sciences 109 (1998) 165-184
1. Introduction
Simulated Annealing (SA) [1-4] belongs to a class of local search algorithm.
It utilizes the principles of statistical mechanics, regarding the behaviour of a
large number of atoms at low temperature, for finding minimal cost solutions
to large optimization problems by minimizing the associated energy. Let
E(q, T) be the energy at temperature T when the system is in the state q. Let
a new state s be generated. Then state s is accepted in favour of state q with
a probability
1
Pqs =
1 + exp/--\~E(q'T)~TE(S'T))! "
In statistical mechanics investigating the ground states or low energy states
of matter is of fundamental importance. These states are achieved at very low
temperature. However, it is not sufficient to lower the temperature alone since
this results in unstable states. In the annealing process, the temperature is first
raised, then decreased gradually to a very low value (Tmin), while ensuring that
one spends sufficient time at each temperature value. This process yields stable
low energy states.
Pattern classification can be viewed as a problem of search and placement of
a number, H, of hyperplanes (fixed a priori) which can model the decision
boundary of the given data set appropriately. The criterion to be minimized
is the number of samples of the given training data that are misclassified for
a particular arrangement of the H hyperplanes. The arrangement of hyperplanes that minimizes the number of misclassified data points is considered
to provide the decision boundary of the given training data set.
The present article describes a methodology demonstrating the searching
ability of SA for finding an appropriate arrangement of H hyperplanes that
minimizes the number of misclassified points. The effectiveness of the classifier
has been adequately established for several artificial and real life data sets with
both overlapping and non-overlapping class boundaries. The results are also
compared with a similar approach [5] based on genetic algorithms (GA)
[4,6], Bayes maximum likelihood classifier, k-NN rule [7] and multilayered perceptron (MLP) [8].
Besides, a theoretical analysis alongwith an empirical verification is presented which shows that for the size of the training data set going to infinity, the SA
based classifier (or SA classifier) will provide an error rate of the training data
which is less than or equal to the Bayes error probability. (In this regard it may
be mentioned here that Bayes maximum likelihood classifier [7] is one of the
most widely used statistical pattern classifiers which provides optimal performance from the standpoint of error probabilities in a statistical framework.
It is known to be the best classifier when the class distributions and the a priori
probabilities are known. Consequently, the desirable property of any classifier
S. Bandyopadhyay et al. I Journal o f lnforma~tion Sciences 109 (1998) 165 184
167
is that it should approximate or approach the Bayes classifier under limiting
conditions.)
A brief discussion on the principles of SA is first presented in Section 2. This
is followed by a detailed description of the SA classifier. The theoretical analysis
is provided in Section 4 followed by the implementation results in Section 5. Finally, the discussion and conclusions are presented the last section.
2. Simulated annealing: Basic principles
In the recent past, application of techniques having physical or natural correspondence for solving difficult optimization problems has received widespread attention. It has been found that these techniques consistently
outperform classical methods like gradient descent search when the search
space is large, complex and multimodal. SA is one such paradigm having its
foundation in statistical mechanics, which studies the behaviour of a very large
system of interacting components in thermal equilibrium.
In statistical mechanics, if the system is in thermal equilibrium, the probability gv(s) that the system is in state s, s E S, S being the state space, at temperature T, is given by
nr(s) = Ewes e-E(wl/*r '
(1)
where k is the Boltzmann's constant and E(s) is the energy of the system in
state s.
Metropolis et al. [9] developed a technique to simulate the behaviour of the
system in thermal equilibrium at temperature T as follows: Let the system be in
state q at time t. Then the probability p that it will be in state s at time t + 1 is
given by the equation
If the energy of the system in state s is less than that in state q, then p > 1 and
the state s is automatically accepted. Otherwise it is accepted with probability
p. Thus it is also possible to attain states with higher energy values. It can be
shown that for t ~ ~ , the probability that the system is in state s is given by
rot(s) irrespective of the starting configuration [10].
When dealing with a system of particles, it is important to investigate very
low energy states, which predominate at extremely low temperatures. To
achieve such states, it is not sufficient to lower the temperature. An annealing
schedule is used, where the temperature is first increased and then decreased
gradually, spending enough time at each temperature in order to reach
thermal equilibrium,
168
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165 184
In this article we have used the annealing process of the Boltzmann machine,
which is a variant of the Metropolis algorithm. Here, at a given temperature T,
the new state is chosen with a probability
Pqs =
1
+
e
x
p
/
'
~
(_E(q,I#_E(s,T)))'
T,
1
The parameters of the search space is encoded in the form of a bit string of a
fixed length. The objective value associated with the string is computed and
mapped to its energy. The string with the minimum energy value provides
the solution to the problem. The initial string (say q) of 0s and 1s is generated
randomly and its energy value is computed. Keeping the initial temperature
high (say T = Tmax),a neighbour of the string (say s) is generated by randomly
flipping one bit. The energy of the new string is computed and it is accepted in
favour of q with a probability Pus mentioned earlier. This process is repeated a
number of times (say k) keeping the temperature constant. Then the temperature is decreased using the equation T = rT, where 0 < r < 1, and the k loops,
as earlier, are executed. This process is continued till a minimum temperature
(say Train) is attained. The simulated annealing steps are shown in Fig. 1.
3. Description of the SA classifier
The correspondence between the physical aspect of SA and an optimization
problem is as follows: the parameters of the search space (in this case the H
Begin
generate the initial string randomly = q
T=Tm~x
Let E(q,T) be the associated energy
while (T _> Tmin)
for i = I to k
Mutate (flip) a random position in q to yield s
Let E(s,T) be the associated energy
Set q +-- s with probability I+e--(E(q,T)--E(s,T))/T
I
end for
T= rT
end while
Decode the string q to provide the solution of the problem.
End
Fig. 1. Steps of simulatedannealing.
s. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184
169
hyperplanes), are encoded in strings (usually binary) and these represent the
different states; low energy states correspond to near optimal solutions (or
an arrangement of the hyperplanes that provide minimum misclassification);
the energy corresponds to objective function (or the number of misclassified
samples), and temperature is a controlling parameter of the system. The important tasks here are to establish a way of representing and generating different
configurations (or states) of the problem and an annealing schedule. These
are now discussed in details.
3.1. State/hyperplane representation
In this article, binary string of length l is used to encode the parameters of
the H hyperplanes. From elementary geometry, the equation of a hyperplane in
N-dimensional space (Xl - X 2 . . . . .
XN) is given by
XN COS au-I + ¢/X I sin aN-1 = d,
(3)
where /~N-I ~-- XN-1 COS aN_ 2 Jl- /~N-2 sin O~N__2,
tiN-: = XN-2 COS aN-3 + flU-3 sin aN-3,
fil = xl COS a0 + fl0 sin a0.
The various parameters are as follows: X~ is the ith feature of the training
points; (Xl,X2,... ,XN) a point on the hyperplane; CtX_~ the angle that the unit
normal to the hyperplane makes with the XN axis; au-2 the angle that the proXN-I) space makes with the X u - I
jection of the normal in the (Xl -)(2 . . . . .
axis; and so on, al the angle that the projection of the normal in the 0(1 - X2)
plane makes with the X2 axis; a0 the angle that the projection of the normal in
the (X0 plane makes with theX1 axis = 0 . Hence, fl0 sin a0 =0; and d the perpendicular distance of the hyperplane from the origin. Thus the N tuple
(al, a 2 , . . . , aN-l, d) specifies a hyperplane in N-dimensional space.
Each angle aj, j = 1 , 2 , . . . , N - 1 is allowed to vary in the range of [0, 2n]. If
bl bits are used to represent an angle, then the possible values of aj are
0,6 2n, 26 2n, 33 2 n , . . . , ( 2 b' - 1)5 2n,
where 5 = 1/2 bl . Consequently, if the bl bits contain a binary string having the
decimal value vl, then the angle is given by vl 6 2n.
Once the angles are fixed, the orientation of the hyperplane becomes fixed.
Now only d must be specified in order to specify the hyperplane. For this purpose
the hyper rectangle enclosing the training points is considered. Let (xmm,x max) be
the minimum and maximum values of feature X,. as obtained from the training
points. Then the vertices of the enclosing hyperrectangle are given by
chl
oh2
chN]
X I ~X 2 ~ ' ' ' ~ X N 1~
170
S. Bandyopadhyay et al. / Journal o f Information Sciences 109 (1998) 165-184
where each chi, i = 1 , 2 , . . . , N can be either max or min. (Note that there will be
2N vertices.) Let diag be the length of the diagonal of this hyperrectangle given by
diag = V/(x] nax - x~in) 2 + (X~nax -- x~nin) 2 -'~- • • • -[- (X~ ax -- x~rin) 2.
A hyperplane is designated as the base hyperplane with respect to a given orientation (i.e., for some ~1, ~2,..., ~s-l) if:
(i) it has the same orientation,
(ii) it passes through one of the vertices of the enclosing rectangle,
(iii) its perpendicular distance from the origin is minimum (among the
hyperplanes passing through the other vertices). Let this distance be drain.
If b2 bits are used to represent d, then a value of v2 in these bits represents a
hyperplane with the given orientation and for which d is given by
dmin q-(diag/2
b2) v2. Thus a string is of a fixed length of l = H ( ( N - 1)
bl q-- b2), where H -- the number of hyperplanes. The initial string is generated
randomly. Note that we have used this recursive form of representation over
the classical one viz. llXl "-~ 12x 2 -~- ' ' ' -~- l N x N ~ d, where l l , . . . , ls are known
as the direction cosines. The latter representation involves a constraint equation, l~ + I~ + .-- + l 2 -- 1. This, in turn, leads to the complicated issue of getting invalid or unacceptable solutions when the constraint equation is violated.
However, the representation that we have chosen avoids this problem by being
unconstrained in nature.
3.2. Energy~objective value computation
A string encodes the parameters of H hyperplanes as described earlier.
Using these parameters, the region in which each training pattern point lies
is determined from Eq. (3). A region is said to provide the demarcation for
class i, if maximum number of points that lie in this region belong to class i.
Other points that lie in this region are considered to be misclassified. The misclassifications associated with all the regions (for these H hyperplanes) are
summed up to provide the total misclassification, miss, for the string, which
represents its energy.
3.3. New state generation process and annealing schedule
For generating a new configuration, one (or more) random position(s) in the
bit string is chosen and flipped. This provides a new string, whose energy is
computed in the above mentioned manner.
As already mentioned, the crucial task over here is the attainment of low energy states, obtained at very low temperatures. If the temperature is decreased
quickly, then the low energy states tend to be unstable. In order to reach stable
states, the temperature must be initially increased, and then decreased gradual-
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165 184
171
q ~-- initial randomly
generated string
Compute miss(q, T)
I
I
T=r*T
Yes
s ~-- m u t a t e random position in q
Compute miss(s, T)
q *- s with prob.
1
l+e
--( m{ss( q,T)--m~ss( a,T) )
T
Decode the parameters
encoded in the final
string q
I
Fig. 2. Steps of the SA classifier.
ly allowing sufficient time at each temperature. This process is known as annealing. In order to simulate this method, initially the temperature is kept high
( = Tmax). A parameter k is used to control the time spent at each temperature
value. The temperature is decreased according to the formula T = rT, where
0 < r < 1. Higher value of r indicates a more gradual annealing schedule.
The different steps of the SA classifier are shown in Fig. 2. The process continues until either a string with no misclassified points is obtained (miss = 0) or an
user specified minimum temperature value ( = Tmin) is attained. The final string
q at termination provides the solution to the problem.
4. Relationship with Bayes error probability
In this section we study the theoretical relationship between the SA classifier
and Bayes classifier in terms of the error probabilities. The mathematical notations and preliminary definitions are described first. This is followed by the
claim that for n -~ ~ the performance of the SA classifier will no way be worse
172
S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184
than that of Bayes classifier. Finally some critical comments about the proof
are mentioned.
Let there be k classes C1, C2,..., Ck with a priori probabilities P1,P2,... ,Pk
and class conditional densities pl (x), pz (x),..., pk (x). Let the mixture density be
k
p(x) = Z P i P i ( X ) .
(4)
i=l
Let X~,X2,...,X,,... be independent and identically distributed (i.i.d) N-dimensional random vectors with density p(x). This indicates that there is a probability space (f2, o~, Q), where ~ is a a field of subsets of f2, Q is a probability
measure on ~ , and
X/: ( f 2 , ~ , Q ) --~ (I~N,B(RN),P)
Vi = 1,2,...
such that
g
P(A)
Q(X~-I(A)) - - / p ( x ) dx
A
VA C B([~ N) a n d '7'i = 1 , 2 , . . . H e r e B ( ~ N) is t h e B o r e l a field o f ~N.
Let
= {E: E = (S,, $2,..., Sk), Si C ~U, St. • 0
k
vi:
1,... ,~,[_Jsi : ~ N , s ~ A s j : 0,vi ~ j } .
i=1
g provides the set of all partitions of ~N into k sets as well as their permutations, i.e.,
E1 = (SI,&,S3... ,S~) ~ e,
E2 = (&,Sl,S3,... ,Sk) ~ e,
then El ¢ E2. Note that E = (S~, Si~,..., Sty) implies that each S~j, 1 ~<j ~<k, is
the region corresponding to class Cj.
Let E0 = (S01,S02,...,S0k) 6 g be such that each Soi is the region corresponding to the class Ci in ~N and these are obtained by using Bayes decision
rule. Then
(5)
s~
s~
VEz = (Sll, 8 1 2 , . . . , Slk) E ~. Here a is the error probability obtained using the
Bayes decision rule.
It is known from the literature that such an E0 exists and it belongs to ~ because Bayes decision rule provides an optimal partition of ~U and for every
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184
173
such E1 = ( S I 1 , S 1 2 , . . . , Slk) E g, ~ = l P i fs c pi(x)dx provides the error proba-li
•
bility for El E o~. Note that E0 need not be umque.
Assumptions 1. Let Ho be a positive integer and let there exist Ho hyperplanes in
NN which can provide the regions Sol, S02,..., Sok. Let Ho be known a priori.
Let the algorithm for generation of class boundaries using Ho hyperplanes be
allowed to be executed for a sufficiently large number of iterations in each
temperature value and for sufficiently low temperatures. Let the number of
strings be t with misclassification values missl,miss2,...,misst where
0 ~<missl ~<miss2 ~< • • • ~<misst. Let ~'i.j ~1) denote the probability of going
from string i to string j in nl steps with the temperature value T. It is known in
the literature that for the adopted SA algorithm
:__ _(~])
hm/~i4 (T) = p , j ( r ) ,
nl~oO
where pi/(T) = e-mis~jr/~tk= l e -mi~/r. It follows that
lim
T~0 +
p~j(T) =
1
f o r j = 1,
=0
forj¢
1.
Thus it is known that using SA technique of(i) making n] ~ c¢ and (ii) making
T --+ 0 ÷, one can get the optimal string and its value.
Let d = {sO: d is a set consisting of H0 hyperplanes in ~N}. Let Ao E s¢
be such that it provides the regions Sol,S02,... ,S0k in EN i.e., Ao provides
the regions which are also obtained using the Bayer decision rule. Note that
each A E d generates several elements of g. Let o~ _c g denote all possible
E = (SI, $ 2 , . . . , Sk) E d~ that can be generated from A.
Let G = ~JAE,~'gA. Let
1
Z,E(to) =
if Xi(to) is misclassified when E is used as a decision
rule where E E G, Vto E f2.
0
otherwise.
n
Let f,e(to) = ( l / n ) ~i=1Zi~(to), when E ~ G is used as a decision rule. Let
f,(to) = Inf{f,E(co): E E G}.
It is to be noted that the pattern classification algorithm mentioned in
Section 2 uses n x fnE(to), the total number of misclassified samples, as the
objective function which it attempts to minimize. This is equivalent to
searching for a suitable E E G such that the term fne(co) is minimized, i.e., for
which fnE (to) = f , (to). As already mentioned, it is known that for infinitely many
iterations the Elitist model of GAs will certainly be able to obtain such an E.
Theorem 2. For sufficiently large n, fn(to)~ a, (i.e., for sufficiently large n, fn(to)
cannot be greater than a) almost everywhere.
174
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184
ProoL Let
~(co) = 1
if X,(co) is misclassified according to Bayes rule Vco E ~2.
= 0 otherwise.
Note that Y~, Y2,. ., Y~,... are i.i.d random variables. Now
k
Prob(Y~ = 1) = ~--]Prob(Y/, = l/X,. is in CflP(Xi is in C/)
j=l
k
= ff-2P/Prob(co: )(/(co) E S~/given that co E C/)
j=l
k
Hence the expectation of I6,.,E(Y,) is given by
E(Y,) = a
Vi.
Then by using Strong Law of Large Numbers [1 1], (1/n) ~i~l Y~~ a almost
everywhere, i.e.,
P co: -
~(co) -+ a
= O.
n i~l
Let
B---
co:
Y~.(o))~a
C_f2.
Then Q(B) = 1. Note that f,(co) ~< (l/n) Z i n l Yi((D), Vn and Vco, since the set of
regions (S01, S0e. . . . , SoD obtained by the Bayes decision rule is also provided by
some A E d and consequently it will be included in G. Note that 0 ,,<f,(co) ~< 1,
Vn and Vco. Let coEB. For every o) EB, U(o))= {fn(co);n---1,2,...} is a
bounded, infinite set. Then by Bolzano-Weierstrass theorem [12], there exists
an accumulation point of U(co). Let y = Sup{y0:y0 is an accumulation point
of U(co)}. From elementary mathematical analysis we can conclude that y ~<a,
since (l/n) ~i~l Y/(co) ~ a almost everywhere and fn(co) ~<(l/n) ~,~t Y,.(co).
Thus it is proved that for sufficiently large n, f,(co) cannot be greater than a
for co E B. []
It is to be mentioned that the theorem proved earlier indicates that as the
size of the training data set is increased, the performance of the SA classifier
will approach that of the Bayes classifier. The fact that f~ (co) < a is true for
only a finite number of sample points, since many distributions can generate
these points. However, as the size of the data set goes to infinity, only one
s. Bandyopadhyay et al. I Journal of lnformation Sciences 109 (1998) 165-184
175
distribution can possibly generate all the points [13]. Also, since we know that
Bayes classifier is the optimal one in a statistical framework, and there can
be no better classifier, the above mentioned claim (that fn(~o) ~<a) can only indicate that fn(to) = a; or in other words, the performance of the SA classifier
will tend to that of the Bayes classifier in the limiting case. This, in turn indicates that under limiting conditions, the boundary provided by the SA classifier will approach the bayes boundary. This is experimentally demonstrated in
Section 5.4.
Note: The term 'sufficiently large' is borrowed from statistics books and indicates mathematical term ' 7 0o'.
5. Implementation and results
The three data sets used for demonstrating the effectiveness of the SA classifier are the following.
A D S 1: This two-dimensional artificial data set (Fig. 3) consists of 557 data
points belonging to two classes. It is evident that the classes, which are separable, have non-linear class boundary.
Vowel data: This real life speech data consists of 871 Indian Telugu vowel
sounds in six classes represented by {6, a, i, e, o, u} [14]. It has three features
corresponding to the first, second and third formant frequencies. Fig. 4 shows
the data set in the first and second formant frequency plane.
Iris data: This four-dimensional data set for a specific category of irises has
150 points in three classes [15]. The features correspond to the sepal width and
length and petal width and length in centimeters.
Data set 1: This two-dimensional data set, used for verifying the theoretical
result in Section 4, is generated using a triangular distribution for the two
classes, 1 and 2. The range for class 1 is [0, 2] × [0, 2] and that for class 2 is
[1, 3] × [0, 2] with the corresponding peaks at (1,1) and (2,1). IfPl is the a priori
probability of class 1, then using elementary mathematics, we can show that
Bayes classifier will classify a point to class 1 if its X coordinate is less than
1 ÷ P1. This indicates that the Bayes decision boundary is given by
x = 1 + P~.
(6)
5.1. Performance o f S A classifier
The parameters of SA are as follows: Tmax -- 100, Tmi n = 0.01, r = 0.9, k =
100.
Accordingly, the maximum number of iterations will be 8800. In order to
generate a new string, one randomly chosen bit is flipped. The results shown
are the average values of five different runs of the algorithm.
176
825
¥
300
S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184
111111111111111111111111111111
11111111111111111111111111111111
1111111111111111111111111111111111
111111111111111111111111111111111111
11111111111
1111111111
11111111111
1111111111
11111111
11111111
iiiiiii
2
22222222
iiiiiii
22
22222222
iiiiii
1111111
2222
22222222
iiiii
111111
iiiii
222222
22222222
Iiiii
22222222
22222222
iiiii
iiiii
11111
222222
22222222
Iiiii
2222
22222222
11111
11111
111111
22
22222222
iiiii
1111111
iiiiii
1111111
111111
11111111
1111111
11111111111111111111111111111111111111
111111111111111111111111111111111111
1111111111111111111111111111111111
11111111111111111111111111111111
I
I
800
2750
X
Fig. 3. Artificial data set ADS 1.
Table 1 shows the overall training performance of the SA classifier for data
sets ADS I, Vowel and Iris using five values of H when I0% of the data set
is used for training. As expected, the training score generally improves to a
maximum of 100% as the number of hyperplanes is increased, since more
hyperplanes can readily fit the training data set to reduce the number of
misclassified points. Note that because of the considerable amount of overlap,
for the Vowel data, consideration of even H = 8 could not provide
zero misclassification.
Tables 2 and 3 show the test results of the SA classifier for these three data
sets, for five values of H, when 10% and 30% of the data set are used for training while the remaining 90% and 70% data are used for testing, respectively.
Unlike the training performance, the test recognition score improves initially
as H is increased upto a specific value, beyond which the score decreases.
For example consider H = 6 and H = 8 of Table 2 for ADS 1 where the score
decreases during testing, although it remained constant (at I00%) during training (Table 1). This indicates that H = 8 leads to overfitting of the classes
]00
SIZE
F R E Q U E N C Y OF O C C U R E N C . E S
"
a
rl
300
I~2
3--5
6~9
D
tO ~ t z ,
[]
15 A N D
ABOVE
~00
e
e
\
~00
e
c 500
o
f
r-'
.I,.'-
0
o~ z
O0
(o~
,
•
~
~
.
e
e
8
. ,~~~.~)
*
~"
;00
r~
]00
xo
~o
!OC
600
i
900
, |
1200
I
I
i
1500
1800
2100
i
,
2/..00
F2 in Hz
:ig. 4. Real life speech data, Vowel data, in the first and second formant frequency planes.
~700
I
178
S. Bandyopadhyayet al. / Journal of lnformation Sciences 109 (1998) 165-184
Table 1
Performance during training of SA classifier for different values of H using 10% training data
Data set
ADS 1
Vowel
Iris
Recognition score
H=3
H=4
H=5
H=6
H=8
94.54
52.94
100.0
98.18
74.71
100.0
100.0
95.29
100.0
100.0
96.65
100.0
100.0
95.29
100.0
Table 2
Performance during testing of SA classifierfor different values of H for 10% training and 90% test
data
Data set
ADS 1
Vowel
Iris
Recognition score
H=3
H=4
H=5
H=6
H=8
91.63
63.35
89.63
92.23
65.60
93.33
93.02
76.84
93.33
93.02
74.55
93.33
88.64
70.73
77.78
Table 3
Performance during testing of SA classifierfor different values of H for 30% training and 70% test
data
Data set
ADS 1
Vowel
Iris
Recognition score
H=3
H=4
H=5
H=6
H=8
91.28
65.60
93.33
96.92
67.48
95.23
98.72
75.98
94.28
96.41
75.00
91.42
96.20
79.90
94.28
d u r i n g training, thereby r e d u c i n g the generalization capability of the classifier
d u r i n g testing. Similar is the case for H = 6 a n d 8 for A D S 1 in T a b l e 3. As
expected, the overall r e c o g n i t i o n capability o f the classifier increases when
the size o f the t r a i n i n g d a t a set is increased from 10% in T a b l e 2 to 30%
in T a b l e 3.
5.2. Replacing simulated annealing with genetic algorithm
G e n e t i c A l g o r i t h m ( G A ) [6] is a n o t h e r e v o l u t i o n a r y search p a r a d i g m , based
o n the principles o f n a t u r a l genetic systems a n d survival o f the fittest. Like SA,
S. Bandyopadhyay et al. I Journal of lnformation Sciences 109 (1998) 165-184
179
GAs also generally work with a binary string encoding of the parameters of the
search problem. Instead of dealing with a single string or chromosome, it operates on a number of strings termed population. A fitness value, which is maximized, is associated with each string which represents the degree of goodness
associated with it. Several biologically inspired operators like selection, crossover and mutation are applied iteratively over a number of generations to generate potentially better solutions. Termination is achieved if either a maximum
number of iterations has been executed or a user specified criterion is satisfied.
Details of the method can be found in [4,6].
The fitness computation method is the same as the process of calculating
the energy associated with a string (see Section 3.2). Roulette wheel selection
strategy, single point crossover strategy with probability 0.8 and bit wise mutation with a variable mutation probability value in the range [0.015,0.333]
[5] for a population size of 20 are chosen for the GA. The maximum number
of iterations is fixed at 1500. The comparative performance (in terms of both
the test score and number of iterations required for attaining zero misclassification during training) of SA and GA for the classification problem is presented in Table 4, when 10% data are considered for training and the
remaining 90% for testing. An entry '-' in iter. field indicates that zero misclassification could not be achieved even after the maximum number of iterations was executed.
As is evident from Table 4, the test recognition scores of both GA and SA
based classifiers are comparable. Although, the iterations required to attain
zero misclassification for GA is less than that for SA, the number of string evaluations is much more since one iteration of GA corresponds to a maximum of
20 strings, which is the size of the population. On the other hand, exactly one
new string is evaluated in each iteration of SA. On this count, GA requires at
most 10 240 and 440 string evaluations for ADS 1 and Iris respectively, which
is significantly more than that required in SA. However, one must note that of
the 10 240 (or 440) strings evaluated by GA for ADS 1 (or Iris) there will be
many replications. In fact, only a relatively small fraction of the strings will
be unique.
Table 4
C o m p a r a t i v e p e r f o r m a n c e of S A a n d G A for classification for H = 6
D a t a set
ADS 1
Vowel
Iris
GA
SA
iter.
score
iter.
score
512
22
93,22
71.99
93.33
5815
97
93.02
74.55
93.33
180
S. Bandyopadhyayet al. / Journal of lnformation Sciences 109 (1998) 165-184
5.3. Comparison with other classifiers
The performance of the SA classifier is compared to Bayes maximum likelihood classifier, Multilayered Perceptron (MLP) and k-NN rule. Both MLP
(with hard delimiters) and k-NN rule are known to provide piecewise linear
boundaries, which is the underlying philosophy of the SA classifier, k-NN algorithm is executed taking k equal to x/~, where n is the number of training
data points. It can be proved that for such a form of k, the error probability
of the k-NN rule approaches the Bayes error probability. For the Bayes maximum likelihood classifier, unequal dispersion matrices and unequal a priori
probabilities ( = ni/n for ni patterns from class i), are considered. In each case,
we assume a multivariate normal distribution of the samples.
For MLP, learning rate and momentum factor are 0.9 and 0.1, respectively.
Online connection weight updation, i.e., updation after the presentation of
each training data point, is performed. A maximum of 10 000 iterations are allowed. The network architectures for ADS 1, Vowel and Iris data sets are 2-52, 3-8-6 and 4-5-3 respectively, where the first and the last numbers represent
the number of nodes in the input and output layers, and the intermediate number(s) represent the number of nodes in the hidden layer(s).
The results in Table 5 show that the SA classifier provides superior performance to all the other classifiers for both ADS 1 (where k-NN is known to perform well) and Iris. For the Vowel Data, the result of the Bayes classifier is the
best. In fact, the Bayes classifier is known to perform well for this data [14]. In
this case also, the recognition score of the SA classifier is found to be closer to
the Bayes score as compared to MLP and k-NN.
5.4. Empirical verification o f the theoretical result
As a consequence of Theorem in Section 4, the boundary provided by the
SA classifier approaches the Bayes boundary under limiting conditions.
Fig. 5(a)-(c) demonstrates that this is indeed the case for the Data Set 1.
The Bayes boundary is a straight line x - - 1.4. The SA line is marked with
an arrow, Figs. 5(a)-(c) show the SA lines obtained for n = 100, 1000 and
4000, respectively. Only 100 data points are plotted in the figures for clarity.
Table 5
Comparative test performancewith 10% training data
Data set
SA classifierfor H = 6
Bayes max. like. class.
MLP k-NN k = vN
ADS 1
Vowel
Iris
93.02
74.55
93.33
85.65
77.73
83.22
82.47 90.23
60.30 70.35
74.81 90.37
J++
2
1.5 ,+
+÷
+
1
+
++
-O+
+
-
O
1.5
~
+
+'p~d-+
:4
+
•
#•
°
>.
•
1
++
++
-
_1.*"
•
++
J
-3-
0.5
+
+
0
0
#+
+"
% ++
+
+
•
0
1
0
•
~
+ ++.++. ++
+ ,~+i+
O
0.5
-0-
+
+
•1+
o
4
o:
-b
-
°
o
_1_ "o
•
tF~ ÷"
-I-
°
h
e~
i
X
X
(a)
(b)
2
•
1.5
+
+
1
+
.0-
.
+++ + +
.#~O-+ ~+
+-
0.5
+~
++
+
+
0
0
•
o
oo
o
o~
-
°o
•
.~°"
o°
l:~_ ÷"
• .
•
°
..
+,
-t-
1
°
2
3
X
(c)
Fig. 5. (a) Data Set 1 for n = 100 and the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary. (b) Data Set
for n = 1000 and the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary. (c) Data Set 1 for n 4000 and
the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary (class ] is represented by '+' and class 2 by 'o').
t~
o~
4~
182
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184
It is obvious from the figures that as n increases from 100 to 4000, the SA line
also approaches the Bayes line, so much so, that for n --4000, they lie very
close to each other.
6. Discussion and conclusions
A pattern classification methodology in ~N, using simulated annealing for
search and placement of a number of hyperplanes in order to approximate
the class boundaries of a given training data set, has been described. An extensive comparison of the methodology with other classifiers, namely the Bayes
classifier (which is well known for discriminating overlapping classes), k-NN
classifier and MLP (which are well known for discriminating non-overlapping,
non-linear regions by generating piecewise linear boundaries) is also presented.
The results of the proposed algorithm are seen to be comparable to, sometimes
better than, them in discriminating both overlapping and non-overlapping,
non-convex regions.
A distinguishing feature of this approach is that the boundaries (approximated by piecewise linear segments) need to be generated explicitly for making
decisions. This is unlike the conventional methods or the MLP based approaches, where the generation of boundaries is a consequence of the respective
decision making processes.
A theoretical analysis of the aforesaid classifier establishes that under limiting conditions of infinitely large training data sets, the error rate of the SA classifier during training is less than or equal to that of the Bayes classifier. This, in
turn, indicates that when the size of the training data set goes to infinity, the
boundary provided by the SA classifier approaches the Bayes boundary. This
finding is also experimentally verified for a data set, generated using triangular
distribution, where the Bayes boundary is known exactly.
A comparison of SA with GA for this classification problem shows that
both perform comparably in terms of the test recognition scores. This is expected, since both are stochastic optimization techniques, working on the same
principle of approximating the class boundaries using a number of hyperplanes. In terms of string evaluations required to obtain the optimal performance, SA appears to score over GA. However, one must note that it is
very difficult to obtain the actual number of distinct string evaluations in
GA, since strings are often replicated. The actual number of distinct evaluations will, in fact, be a small fraction of the quantity (population size x number
of iterations).
Although SA is found to perform comparably to GA, there appear to be
several factors contributing to the predominance of GAs in the literature. In
SA, two main control parameters are to be selected appropriately in order to
obtain good performance. These are the values of r (which controls the
S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184
183
sequence of T) and k (the number of iterations executed at each temperature).
On the other hand, in GA, only the maximum number of iterations (or stopping time) must be appropriately selected. Other than this, both the methods
need proper tuning of several other parameters, e.g., T~ax, Train in SA, probabilities for crossover, mutation in GA, etc. Additionally, in the advanced stages of
the SA algorithm, the temperature values should be smaller than the smallest
difference of the energy values in order to provide good performance. Since
for the pattern classification problem, this value is 1 (minimum non-zero difference of number of misclassified points) and Tmin = 0.01, this requirement is
met. GA, with roulette wheel selection, is, on the other hand, immune to this
difference. Finally, since SA is inherently sequential in nature, not much
improvement can be derived in parallel computing platforms, while there is
scope for such improvement in GA. One must note that very basic versions
of both SA and GA are used here. Use of enhanced models and improved
operators for both SA and GA may provide better performance. For example,
in case of SA, other cooling schedules [16,17] may be used. Similarly, modified
versions of GA, like genetic algorithm with chromosome differentiation,
(GACD), may be applied which has been found to improve the classification
performance [18].
Acknowledgements
This work was carried out when Ms. Sanghamitra Bandyopadhyay held a
fellowship awarded by the Department of Atomic Energy, Govt. of India.
References
[1] P.J.M. van Laarhoven, E.H.L. Aarts, Simulated Annealing: Theory and Applications, Reidel,
Dordrecht, 1987.
[2] S. Kirkpatrick, C.D. Gelatt, Jr., M.P. Vechhi, Optimization by simulated annealing, Science
220 (1983) 671-680.
[3] L. Davis, ed., Genetic Algorithms and Simulated Annealing, Pitman Publishing, London,
1987.
[4] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer,
Berlin, 1992.
[5] S. Bandyopadhyay, C.A. Murthy, S.K. Pal, Pattern classification using genetic algorithms,
Patt. Recog. Lett. 16 (8) (1995) 801-808.
[6] D.E. Goldberg, Genetic Algorithms: Search, Optimization and Machine Learning, AddisonWesley, Reading, MA, 1989.
[7] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, MA,
1974.
[8] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the theory of neural computation, AddisonWesley, Reading, MA, 1991.
184
S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184
[9] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, Equations of state
calculations by fast computing machines, J. Chem. Phys. 21 (1953) 1087-1091.
[10] S. Geman, D. Geman, Stochastic relaxation, gibbs distribution, and the bayesian restoration
of images, IEEE Trans. Patt. Anal. Machine Intell. PAMI- 6 (1984) 721-741.
[11] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.
[12] T.M. Apostol, Mathematical Analysis, Narosa, New Delhi, 1985.
[13] R.B. Ash, Real Analysis and Probability, Academic Press, New York, 1972.
[14] S.K. Pal, S. Mitra, Multilayer perceptron, fuzzy sets and classification, IEEE Trans. Neural
Networks 3 (5) (1992) 683-697.
[15] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 3
(1936) 179-188.
[16] S. Shinomoto, Y. Kabashima, Finite time scaling of energy in simulated annealing, J. Phys. A
24 (1991) L141-LI44.
[17] E.H.L. Aarts, P.J.M. van Laarhoven, A pedestrian review on the theory and applications of
simulated annealing algorithm, in: J. van Hemmen, I. Morgenstern (Eds.), Heidelberg
Colloquim on Glassy Dynamics, Springer, Berlin, 1987, pp. 288-307.
[18] S. Bandyopadhyay, S.K. Pal, Pattern classification with genetic algorithms: Incorporation of
chromosome differentiation, Pattern Recog. Lett. 18 (1997) 119-131.