Academia.eduAcademia.edu

Simulated annealing based pattern classification

1998, Information Sciences

A method is described for finding decision boundaries, approximated by piecewise linear segments, for classifying patterns in ~N,N >~ 2, using Simulated Annealing (SA). It involves generation and placement of a set of hyperplanes (represented by strings) in the feature space that yields minimum misclassification. Theoretical analysis shows that as the size of the training data set approaches infinity, the boundary provided by the SA based classifier will approach the Bayes boundary. The effectiveness of the classification methodology, along with the generalization ability of the decision boundary, is demonstrated for both artificial data and real life data sets having non-linear/ overlapping class boundaries. Results are compared extensively with those of the Bayes classifier, k-NN rule and multilayer perceptron, and Genetic Algorithms, another popular evolutionary technique. Empirical verification of the theoretical claim is also provided.

INFORMATION SCIENCES AN ]NI~KN~TIONA[ J~URNAI ELSEVIER Journal of Information Sciences 109 (1998) 165 184 Simulated annealing based pattern classification Sanghamitra Bandyopadhyay a, Sankar K. Pal a,, C.A. Murthy b,1 " Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road, Calcutta 700 035, India b Department of Statisties, 326, Thomas Building, Pennsylvania State University, UniversiO, Park. PA-16802-111, USA Received 11 February 1997; received in revised form 20 September 1997: accepted 18 November 1997 Abstract A method is described for finding decision boundaries, approximated by piecewise linear segments, for classifying patterns in ~N,N>~ 2, using Simulated Annealing (SA). It involves generation and placement of a set of hyperplanes (represented by strings) in the feature space that yields minimum misclassification. Theoretical analysis shows that as the size of the training data set approaches infinity, the boundary provided by the SA based classifier will approach the Bayes boundary. The effectiveness of the classification methodology, along with the generalization ability of the decision boundary, is demonstrated for both artificial data and real life data sets having non-linear/ overlapping class boundaries. Results are compared extensively with those of the Bayes classifier, k-NN rule and multilayer perceptron, and Genetic Algorithms, another popular evolutionary technique. Empirical verification of the theoretical claim is also provided. © 1998 Elsevier Science Inc. All rights reserved. Keywords: Pattern recognition; Evolutionary computation; Genetic algorithm; Bayes error probability *Corresponding author. Tel.: +91 33 577 8085 3100; fax: +91 33 556 6680/6925; e-mail: [email protected]. On leave from Indian Statistical Institute. 0020-0255/98/$19.00 © 1998 Elsevier Science Inc. All rights reserved. PII:S0020-0255(98)0001 7-6 166 S. Bandyopadhyay et al. / Journal o f lnformation Sciences 109 (1998) 165-184 1. Introduction Simulated Annealing (SA) [1-4] belongs to a class of local search algorithm. It utilizes the principles of statistical mechanics, regarding the behaviour of a large number of atoms at low temperature, for finding minimal cost solutions to large optimization problems by minimizing the associated energy. Let E(q, T) be the energy at temperature T when the system is in the state q. Let a new state s be generated. Then state s is accepted in favour of state q with a probability 1 Pqs = 1 + exp/--\~E(q'T)~TE(S'T))! " In statistical mechanics investigating the ground states or low energy states of matter is of fundamental importance. These states are achieved at very low temperature. However, it is not sufficient to lower the temperature alone since this results in unstable states. In the annealing process, the temperature is first raised, then decreased gradually to a very low value (Tmin), while ensuring that one spends sufficient time at each temperature value. This process yields stable low energy states. Pattern classification can be viewed as a problem of search and placement of a number, H, of hyperplanes (fixed a priori) which can model the decision boundary of the given data set appropriately. The criterion to be minimized is the number of samples of the given training data that are misclassified for a particular arrangement of the H hyperplanes. The arrangement of hyperplanes that minimizes the number of misclassified data points is considered to provide the decision boundary of the given training data set. The present article describes a methodology demonstrating the searching ability of SA for finding an appropriate arrangement of H hyperplanes that minimizes the number of misclassified points. The effectiveness of the classifier has been adequately established for several artificial and real life data sets with both overlapping and non-overlapping class boundaries. The results are also compared with a similar approach [5] based on genetic algorithms (GA) [4,6], Bayes maximum likelihood classifier, k-NN rule [7] and multilayered perceptron (MLP) [8]. Besides, a theoretical analysis alongwith an empirical verification is presented which shows that for the size of the training data set going to infinity, the SA based classifier (or SA classifier) will provide an error rate of the training data which is less than or equal to the Bayes error probability. (In this regard it may be mentioned here that Bayes maximum likelihood classifier [7] is one of the most widely used statistical pattern classifiers which provides optimal performance from the standpoint of error probabilities in a statistical framework. It is known to be the best classifier when the class distributions and the a priori probabilities are known. Consequently, the desirable property of any classifier S. Bandyopadhyay et al. I Journal o f lnforma~tion Sciences 109 (1998) 165 184 167 is that it should approximate or approach the Bayes classifier under limiting conditions.) A brief discussion on the principles of SA is first presented in Section 2. This is followed by a detailed description of the SA classifier. The theoretical analysis is provided in Section 4 followed by the implementation results in Section 5. Finally, the discussion and conclusions are presented the last section. 2. Simulated annealing: Basic principles In the recent past, application of techniques having physical or natural correspondence for solving difficult optimization problems has received widespread attention. It has been found that these techniques consistently outperform classical methods like gradient descent search when the search space is large, complex and multimodal. SA is one such paradigm having its foundation in statistical mechanics, which studies the behaviour of a very large system of interacting components in thermal equilibrium. In statistical mechanics, if the system is in thermal equilibrium, the probability gv(s) that the system is in state s, s E S, S being the state space, at temperature T, is given by nr(s) = Ewes e-E(wl/*r ' (1) where k is the Boltzmann's constant and E(s) is the energy of the system in state s. Metropolis et al. [9] developed a technique to simulate the behaviour of the system in thermal equilibrium at temperature T as follows: Let the system be in state q at time t. Then the probability p that it will be in state s at time t + 1 is given by the equation If the energy of the system in state s is less than that in state q, then p > 1 and the state s is automatically accepted. Otherwise it is accepted with probability p. Thus it is also possible to attain states with higher energy values. It can be shown that for t ~ ~ , the probability that the system is in state s is given by rot(s) irrespective of the starting configuration [10]. When dealing with a system of particles, it is important to investigate very low energy states, which predominate at extremely low temperatures. To achieve such states, it is not sufficient to lower the temperature. An annealing schedule is used, where the temperature is first increased and then decreased gradually, spending enough time at each temperature in order to reach thermal equilibrium, 168 S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165 184 In this article we have used the annealing process of the Boltzmann machine, which is a variant of the Metropolis algorithm. Here, at a given temperature T, the new state is chosen with a probability Pqs = 1 + e x p / ' ~ (_E(q,I#_E(s,T)))' T, 1 The parameters of the search space is encoded in the form of a bit string of a fixed length. The objective value associated with the string is computed and mapped to its energy. The string with the minimum energy value provides the solution to the problem. The initial string (say q) of 0s and 1s is generated randomly and its energy value is computed. Keeping the initial temperature high (say T = Tmax),a neighbour of the string (say s) is generated by randomly flipping one bit. The energy of the new string is computed and it is accepted in favour of q with a probability Pus mentioned earlier. This process is repeated a number of times (say k) keeping the temperature constant. Then the temperature is decreased using the equation T = rT, where 0 < r < 1, and the k loops, as earlier, are executed. This process is continued till a minimum temperature (say Train) is attained. The simulated annealing steps are shown in Fig. 1. 3. Description of the SA classifier The correspondence between the physical aspect of SA and an optimization problem is as follows: the parameters of the search space (in this case the H Begin generate the initial string randomly = q T=Tm~x Let E(q,T) be the associated energy while (T _> Tmin) for i = I to k Mutate (flip) a random position in q to yield s Let E(s,T) be the associated energy Set q +-- s with probability I+e--(E(q,T)--E(s,T))/T I end for T= rT end while Decode the string q to provide the solution of the problem. End Fig. 1. Steps of simulatedannealing. s. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184 169 hyperplanes), are encoded in strings (usually binary) and these represent the different states; low energy states correspond to near optimal solutions (or an arrangement of the hyperplanes that provide minimum misclassification); the energy corresponds to objective function (or the number of misclassified samples), and temperature is a controlling parameter of the system. The important tasks here are to establish a way of representing and generating different configurations (or states) of the problem and an annealing schedule. These are now discussed in details. 3.1. State/hyperplane representation In this article, binary string of length l is used to encode the parameters of the H hyperplanes. From elementary geometry, the equation of a hyperplane in N-dimensional space (Xl - X 2 . . . . . XN) is given by XN COS au-I + ¢/X I sin aN-1 = d, (3) where /~N-I ~-- XN-1 COS aN_ 2 Jl- /~N-2 sin O~N__2, tiN-: = XN-2 COS aN-3 + flU-3 sin aN-3, fil = xl COS a0 + fl0 sin a0. The various parameters are as follows: X~ is the ith feature of the training points; (Xl,X2,... ,XN) a point on the hyperplane; CtX_~ the angle that the unit normal to the hyperplane makes with the XN axis; au-2 the angle that the proXN-I) space makes with the X u - I jection of the normal in the (Xl -)(2 . . . . . axis; and so on, al the angle that the projection of the normal in the 0(1 - X2) plane makes with the X2 axis; a0 the angle that the projection of the normal in the (X0 plane makes with theX1 axis = 0 . Hence, fl0 sin a0 =0; and d the perpendicular distance of the hyperplane from the origin. Thus the N tuple (al, a 2 , . . . , aN-l, d) specifies a hyperplane in N-dimensional space. Each angle aj, j = 1 , 2 , . . . , N - 1 is allowed to vary in the range of [0, 2n]. If bl bits are used to represent an angle, then the possible values of aj are 0,6 2n, 26 2n, 33 2 n , . . . , ( 2 b' - 1)5 2n, where 5 = 1/2 bl . Consequently, if the bl bits contain a binary string having the decimal value vl, then the angle is given by vl 6 2n. Once the angles are fixed, the orientation of the hyperplane becomes fixed. Now only d must be specified in order to specify the hyperplane. For this purpose the hyper rectangle enclosing the training points is considered. Let (xmm,x max) be the minimum and maximum values of feature X,. as obtained from the training points. Then the vertices of the enclosing hyperrectangle are given by chl oh2 chN] X I ~X 2 ~ ' ' ' ~ X N 1~ 170 S. Bandyopadhyay et al. / Journal o f Information Sciences 109 (1998) 165-184 where each chi, i = 1 , 2 , . . . , N can be either max or min. (Note that there will be 2N vertices.) Let diag be the length of the diagonal of this hyperrectangle given by diag = V/(x] nax - x~in) 2 + (X~nax -- x~nin) 2 -'~- • • • -[- (X~ ax -- x~rin) 2. A hyperplane is designated as the base hyperplane with respect to a given orientation (i.e., for some ~1, ~2,..., ~s-l) if: (i) it has the same orientation, (ii) it passes through one of the vertices of the enclosing rectangle, (iii) its perpendicular distance from the origin is minimum (among the hyperplanes passing through the other vertices). Let this distance be drain. If b2 bits are used to represent d, then a value of v2 in these bits represents a hyperplane with the given orientation and for which d is given by dmin q-(diag/2 b2) v2. Thus a string is of a fixed length of l = H ( ( N - 1) bl q-- b2), where H -- the number of hyperplanes. The initial string is generated randomly. Note that we have used this recursive form of representation over the classical one viz. llXl "-~ 12x 2 -~- ' ' ' -~- l N x N ~ d, where l l , . . . , ls are known as the direction cosines. The latter representation involves a constraint equation, l~ + I~ + .-- + l 2 -- 1. This, in turn, leads to the complicated issue of getting invalid or unacceptable solutions when the constraint equation is violated. However, the representation that we have chosen avoids this problem by being unconstrained in nature. 3.2. Energy~objective value computation A string encodes the parameters of H hyperplanes as described earlier. Using these parameters, the region in which each training pattern point lies is determined from Eq. (3). A region is said to provide the demarcation for class i, if maximum number of points that lie in this region belong to class i. Other points that lie in this region are considered to be misclassified. The misclassifications associated with all the regions (for these H hyperplanes) are summed up to provide the total misclassification, miss, for the string, which represents its energy. 3.3. New state generation process and annealing schedule For generating a new configuration, one (or more) random position(s) in the bit string is chosen and flipped. This provides a new string, whose energy is computed in the above mentioned manner. As already mentioned, the crucial task over here is the attainment of low energy states, obtained at very low temperatures. If the temperature is decreased quickly, then the low energy states tend to be unstable. In order to reach stable states, the temperature must be initially increased, and then decreased gradual- S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165 184 171 q ~-- initial randomly generated string Compute miss(q, T) I I T=r*T Yes s ~-- m u t a t e random position in q Compute miss(s, T) q *- s with prob. 1 l+e --( m{ss( q,T)--m~ss( a,T) ) T Decode the parameters encoded in the final string q I Fig. 2. Steps of the SA classifier. ly allowing sufficient time at each temperature. This process is known as annealing. In order to simulate this method, initially the temperature is kept high ( = Tmax). A parameter k is used to control the time spent at each temperature value. The temperature is decreased according to the formula T = rT, where 0 < r < 1. Higher value of r indicates a more gradual annealing schedule. The different steps of the SA classifier are shown in Fig. 2. The process continues until either a string with no misclassified points is obtained (miss = 0) or an user specified minimum temperature value ( = Tmin) is attained. The final string q at termination provides the solution to the problem. 4. Relationship with Bayes error probability In this section we study the theoretical relationship between the SA classifier and Bayes classifier in terms of the error probabilities. The mathematical notations and preliminary definitions are described first. This is followed by the claim that for n -~ ~ the performance of the SA classifier will no way be worse 172 S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184 than that of Bayes classifier. Finally some critical comments about the proof are mentioned. Let there be k classes C1, C2,..., Ck with a priori probabilities P1,P2,... ,Pk and class conditional densities pl (x), pz (x),..., pk (x). Let the mixture density be k p(x) = Z P i P i ( X ) . (4) i=l Let X~,X2,...,X,,... be independent and identically distributed (i.i.d) N-dimensional random vectors with density p(x). This indicates that there is a probability space (f2, o~, Q), where ~ is a a field of subsets of f2, Q is a probability measure on ~ , and X/: ( f 2 , ~ , Q ) --~ (I~N,B(RN),P) Vi = 1,2,... such that g P(A) Q(X~-I(A)) - - / p ( x ) dx A VA C B([~ N) a n d '7'i = 1 , 2 , . . . H e r e B ( ~ N) is t h e B o r e l a field o f ~N. Let = {E: E = (S,, $2,..., Sk), Si C ~U, St. • 0 k vi: 1,... ,~,[_Jsi : ~ N , s ~ A s j : 0,vi ~ j } . i=1 g provides the set of all partitions of ~N into k sets as well as their permutations, i.e., E1 = (SI,&,S3... ,S~) ~ e, E2 = (&,Sl,S3,... ,Sk) ~ e, then El ¢ E2. Note that E = (S~, Si~,..., Sty) implies that each S~j, 1 ~<j ~<k, is the region corresponding to class Cj. Let E0 = (S01,S02,...,S0k) 6 g be such that each Soi is the region corresponding to the class Ci in ~N and these are obtained by using Bayes decision rule. Then (5) s~ s~ VEz = (Sll, 8 1 2 , . . . , Slk) E ~. Here a is the error probability obtained using the Bayes decision rule. It is known from the literature that such an E0 exists and it belongs to ~ because Bayes decision rule provides an optimal partition of ~U and for every S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184 173 such E1 = ( S I 1 , S 1 2 , . . . , Slk) E g, ~ = l P i fs c pi(x)dx provides the error proba-li • bility for El E o~. Note that E0 need not be umque. Assumptions 1. Let Ho be a positive integer and let there exist Ho hyperplanes in NN which can provide the regions Sol, S02,..., Sok. Let Ho be known a priori. Let the algorithm for generation of class boundaries using Ho hyperplanes be allowed to be executed for a sufficiently large number of iterations in each temperature value and for sufficiently low temperatures. Let the number of strings be t with misclassification values missl,miss2,...,misst where 0 ~<missl ~<miss2 ~< • • • ~<misst. Let ~'i.j ~1) denote the probability of going from string i to string j in nl steps with the temperature value T. It is known in the literature that for the adopted SA algorithm :__ _(~]) hm/~i4 (T) = p , j ( r ) , nl~oO where pi/(T) = e-mis~jr/~tk= l e -mi~/r. It follows that lim T~0 + p~j(T) = 1 f o r j = 1, =0 forj¢ 1. Thus it is known that using SA technique of(i) making n] ~ c¢ and (ii) making T --+ 0 ÷, one can get the optimal string and its value. Let d = {sO: d is a set consisting of H0 hyperplanes in ~N}. Let Ao E s¢ be such that it provides the regions Sol,S02,... ,S0k in EN i.e., Ao provides the regions which are also obtained using the Bayer decision rule. Note that each A E d generates several elements of g. Let o~ _c g denote all possible E = (SI, $ 2 , . . . , Sk) E d~ that can be generated from A. Let G = ~JAE,~'gA. Let 1 Z,E(to) = if Xi(to) is misclassified when E is used as a decision rule where E E G, Vto E f2. 0 otherwise. n Let f,e(to) = ( l / n ) ~i=1Zi~(to), when E ~ G is used as a decision rule. Let f,(to) = Inf{f,E(co): E E G}. It is to be noted that the pattern classification algorithm mentioned in Section 2 uses n x fnE(to), the total number of misclassified samples, as the objective function which it attempts to minimize. This is equivalent to searching for a suitable E E G such that the term fne(co) is minimized, i.e., for which fnE (to) = f , (to). As already mentioned, it is known that for infinitely many iterations the Elitist model of GAs will certainly be able to obtain such an E. Theorem 2. For sufficiently large n, fn(to)~ a, (i.e., for sufficiently large n, fn(to) cannot be greater than a) almost everywhere. 174 S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184 ProoL Let ~(co) = 1 if X,(co) is misclassified according to Bayes rule Vco E ~2. = 0 otherwise. Note that Y~, Y2,. ., Y~,... are i.i.d random variables. Now k Prob(Y~ = 1) = ~--]Prob(Y/, = l/X,. is in CflP(Xi is in C/) j=l k = ff-2P/Prob(co: )(/(co) E S~/given that co E C/) j=l k Hence the expectation of I6,.,E(Y,) is given by E(Y,) = a Vi. Then by using Strong Law of Large Numbers [1 1], (1/n) ~i~l Y~~ a almost everywhere, i.e., P co: - ~(co) -+ a = O. n i~l Let B--- co: Y~.(o))~a C_f2. Then Q(B) = 1. Note that f,(co) ~< (l/n) Z i n l Yi((D), Vn and Vco, since the set of regions (S01, S0e. . . . , SoD obtained by the Bayes decision rule is also provided by some A E d and consequently it will be included in G. Note that 0 ,,<f,(co) ~< 1, Vn and Vco. Let coEB. For every o) EB, U(o))= {fn(co);n---1,2,...} is a bounded, infinite set. Then by Bolzano-Weierstrass theorem [12], there exists an accumulation point of U(co). Let y = Sup{y0:y0 is an accumulation point of U(co)}. From elementary mathematical analysis we can conclude that y ~<a, since (l/n) ~i~l Y/(co) ~ a almost everywhere and fn(co) ~<(l/n) ~,~t Y,.(co). Thus it is proved that for sufficiently large n, f,(co) cannot be greater than a for co E B. [] It is to be mentioned that the theorem proved earlier indicates that as the size of the training data set is increased, the performance of the SA classifier will approach that of the Bayes classifier. The fact that f~ (co) < a is true for only a finite number of sample points, since many distributions can generate these points. However, as the size of the data set goes to infinity, only one s. Bandyopadhyay et al. I Journal of lnformation Sciences 109 (1998) 165-184 175 distribution can possibly generate all the points [13]. Also, since we know that Bayes classifier is the optimal one in a statistical framework, and there can be no better classifier, the above mentioned claim (that fn(~o) ~<a) can only indicate that fn(to) = a; or in other words, the performance of the SA classifier will tend to that of the Bayes classifier in the limiting case. This, in turn indicates that under limiting conditions, the boundary provided by the SA classifier will approach the bayes boundary. This is experimentally demonstrated in Section 5.4. Note: The term 'sufficiently large' is borrowed from statistics books and indicates mathematical term ' 7 0o'. 5. Implementation and results The three data sets used for demonstrating the effectiveness of the SA classifier are the following. A D S 1: This two-dimensional artificial data set (Fig. 3) consists of 557 data points belonging to two classes. It is evident that the classes, which are separable, have non-linear class boundary. Vowel data: This real life speech data consists of 871 Indian Telugu vowel sounds in six classes represented by {6, a, i, e, o, u} [14]. It has three features corresponding to the first, second and third formant frequencies. Fig. 4 shows the data set in the first and second formant frequency plane. Iris data: This four-dimensional data set for a specific category of irises has 150 points in three classes [15]. The features correspond to the sepal width and length and petal width and length in centimeters. Data set 1: This two-dimensional data set, used for verifying the theoretical result in Section 4, is generated using a triangular distribution for the two classes, 1 and 2. The range for class 1 is [0, 2] × [0, 2] and that for class 2 is [1, 3] × [0, 2] with the corresponding peaks at (1,1) and (2,1). IfPl is the a priori probability of class 1, then using elementary mathematics, we can show that Bayes classifier will classify a point to class 1 if its X coordinate is less than 1 ÷ P1. This indicates that the Bayes decision boundary is given by x = 1 + P~. (6) 5.1. Performance o f S A classifier The parameters of SA are as follows: Tmax -- 100, Tmi n = 0.01, r = 0.9, k = 100. Accordingly, the maximum number of iterations will be 8800. In order to generate a new string, one randomly chosen bit is flipped. The results shown are the average values of five different runs of the algorithm. 176 825 ¥ 300 S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184 111111111111111111111111111111 11111111111111111111111111111111 1111111111111111111111111111111111 111111111111111111111111111111111111 11111111111 1111111111 11111111111 1111111111 11111111 11111111 iiiiiii 2 22222222 iiiiiii 22 22222222 iiiiii 1111111 2222 22222222 iiiii 111111 iiiii 222222 22222222 Iiiii 22222222 22222222 iiiii iiiii 11111 222222 22222222 Iiiii 2222 22222222 11111 11111 111111 22 22222222 iiiii 1111111 iiiiii 1111111 111111 11111111 1111111 11111111111111111111111111111111111111 111111111111111111111111111111111111 1111111111111111111111111111111111 11111111111111111111111111111111 I I 800 2750 X Fig. 3. Artificial data set ADS 1. Table 1 shows the overall training performance of the SA classifier for data sets ADS I, Vowel and Iris using five values of H when I0% of the data set is used for training. As expected, the training score generally improves to a maximum of 100% as the number of hyperplanes is increased, since more hyperplanes can readily fit the training data set to reduce the number of misclassified points. Note that because of the considerable amount of overlap, for the Vowel data, consideration of even H = 8 could not provide zero misclassification. Tables 2 and 3 show the test results of the SA classifier for these three data sets, for five values of H, when 10% and 30% of the data set are used for training while the remaining 90% and 70% data are used for testing, respectively. Unlike the training performance, the test recognition score improves initially as H is increased upto a specific value, beyond which the score decreases. For example consider H = 6 and H = 8 of Table 2 for ADS 1 where the score decreases during testing, although it remained constant (at I00%) during training (Table 1). This indicates that H = 8 leads to overfitting of the classes ]00 SIZE F R E Q U E N C Y OF O C C U R E N C . E S " a rl 300 I~2 3--5 6~9 D tO ~ t z , [] 15 A N D ABOVE ~00 e e \ ~00 e c 500 o f r-' .I,.'- 0 o~ z O0 (o~ , • ~ ~ . e e 8 . ,~~~.~) * ~" ;00 r~ ]00 xo ~o !OC 600 i 900 , | 1200 I I i 1500 1800 2100 i , 2/..00 F2 in Hz :ig. 4. Real life speech data, Vowel data, in the first and second formant frequency planes. ~700 I 178 S. Bandyopadhyayet al. / Journal of lnformation Sciences 109 (1998) 165-184 Table 1 Performance during training of SA classifier for different values of H using 10% training data Data set ADS 1 Vowel Iris Recognition score H=3 H=4 H=5 H=6 H=8 94.54 52.94 100.0 98.18 74.71 100.0 100.0 95.29 100.0 100.0 96.65 100.0 100.0 95.29 100.0 Table 2 Performance during testing of SA classifierfor different values of H for 10% training and 90% test data Data set ADS 1 Vowel Iris Recognition score H=3 H=4 H=5 H=6 H=8 91.63 63.35 89.63 92.23 65.60 93.33 93.02 76.84 93.33 93.02 74.55 93.33 88.64 70.73 77.78 Table 3 Performance during testing of SA classifierfor different values of H for 30% training and 70% test data Data set ADS 1 Vowel Iris Recognition score H=3 H=4 H=5 H=6 H=8 91.28 65.60 93.33 96.92 67.48 95.23 98.72 75.98 94.28 96.41 75.00 91.42 96.20 79.90 94.28 d u r i n g training, thereby r e d u c i n g the generalization capability of the classifier d u r i n g testing. Similar is the case for H = 6 a n d 8 for A D S 1 in T a b l e 3. As expected, the overall r e c o g n i t i o n capability o f the classifier increases when the size o f the t r a i n i n g d a t a set is increased from 10% in T a b l e 2 to 30% in T a b l e 3. 5.2. Replacing simulated annealing with genetic algorithm G e n e t i c A l g o r i t h m ( G A ) [6] is a n o t h e r e v o l u t i o n a r y search p a r a d i g m , based o n the principles o f n a t u r a l genetic systems a n d survival o f the fittest. Like SA, S. Bandyopadhyay et al. I Journal of lnformation Sciences 109 (1998) 165-184 179 GAs also generally work with a binary string encoding of the parameters of the search problem. Instead of dealing with a single string or chromosome, it operates on a number of strings termed population. A fitness value, which is maximized, is associated with each string which represents the degree of goodness associated with it. Several biologically inspired operators like selection, crossover and mutation are applied iteratively over a number of generations to generate potentially better solutions. Termination is achieved if either a maximum number of iterations has been executed or a user specified criterion is satisfied. Details of the method can be found in [4,6]. The fitness computation method is the same as the process of calculating the energy associated with a string (see Section 3.2). Roulette wheel selection strategy, single point crossover strategy with probability 0.8 and bit wise mutation with a variable mutation probability value in the range [0.015,0.333] [5] for a population size of 20 are chosen for the GA. The maximum number of iterations is fixed at 1500. The comparative performance (in terms of both the test score and number of iterations required for attaining zero misclassification during training) of SA and GA for the classification problem is presented in Table 4, when 10% data are considered for training and the remaining 90% for testing. An entry '-' in iter. field indicates that zero misclassification could not be achieved even after the maximum number of iterations was executed. As is evident from Table 4, the test recognition scores of both GA and SA based classifiers are comparable. Although, the iterations required to attain zero misclassification for GA is less than that for SA, the number of string evaluations is much more since one iteration of GA corresponds to a maximum of 20 strings, which is the size of the population. On the other hand, exactly one new string is evaluated in each iteration of SA. On this count, GA requires at most 10 240 and 440 string evaluations for ADS 1 and Iris respectively, which is significantly more than that required in SA. However, one must note that of the 10 240 (or 440) strings evaluated by GA for ADS 1 (or Iris) there will be many replications. In fact, only a relatively small fraction of the strings will be unique. Table 4 C o m p a r a t i v e p e r f o r m a n c e of S A a n d G A for classification for H = 6 D a t a set ADS 1 Vowel Iris GA SA iter. score iter. score 512 22 93,22 71.99 93.33 5815 97 93.02 74.55 93.33 180 S. Bandyopadhyayet al. / Journal of lnformation Sciences 109 (1998) 165-184 5.3. Comparison with other classifiers The performance of the SA classifier is compared to Bayes maximum likelihood classifier, Multilayered Perceptron (MLP) and k-NN rule. Both MLP (with hard delimiters) and k-NN rule are known to provide piecewise linear boundaries, which is the underlying philosophy of the SA classifier, k-NN algorithm is executed taking k equal to x/~, where n is the number of training data points. It can be proved that for such a form of k, the error probability of the k-NN rule approaches the Bayes error probability. For the Bayes maximum likelihood classifier, unequal dispersion matrices and unequal a priori probabilities ( = ni/n for ni patterns from class i), are considered. In each case, we assume a multivariate normal distribution of the samples. For MLP, learning rate and momentum factor are 0.9 and 0.1, respectively. Online connection weight updation, i.e., updation after the presentation of each training data point, is performed. A maximum of 10 000 iterations are allowed. The network architectures for ADS 1, Vowel and Iris data sets are 2-52, 3-8-6 and 4-5-3 respectively, where the first and the last numbers represent the number of nodes in the input and output layers, and the intermediate number(s) represent the number of nodes in the hidden layer(s). The results in Table 5 show that the SA classifier provides superior performance to all the other classifiers for both ADS 1 (where k-NN is known to perform well) and Iris. For the Vowel Data, the result of the Bayes classifier is the best. In fact, the Bayes classifier is known to perform well for this data [14]. In this case also, the recognition score of the SA classifier is found to be closer to the Bayes score as compared to MLP and k-NN. 5.4. Empirical verification o f the theoretical result As a consequence of Theorem in Section 4, the boundary provided by the SA classifier approaches the Bayes boundary under limiting conditions. Fig. 5(a)-(c) demonstrates that this is indeed the case for the Data Set 1. The Bayes boundary is a straight line x - - 1.4. The SA line is marked with an arrow, Figs. 5(a)-(c) show the SA lines obtained for n = 100, 1000 and 4000, respectively. Only 100 data points are plotted in the figures for clarity. Table 5 Comparative test performancewith 10% training data Data set SA classifierfor H = 6 Bayes max. like. class. MLP k-NN k = vN ADS 1 Vowel Iris 93.02 74.55 93.33 85.65 77.73 83.22 82.47 90.23 60.30 70.35 74.81 90.37 J++ 2 1.5 ,+ +÷ + 1 + ++ -O+ + - O 1.5 ~ + +'p~d-+ :4 + • #• ° >. • 1 ++ ++ - _1.*" • ++ J -3- 0.5 + + 0 0 #+ +" % ++ + + • 0 1 0 • ~ + ++.++. ++ + ,~+i+ O 0.5 -0- + + •1+ o 4 o: -b - ° o _1_ "o • tF~ ÷" -I- ° h e~ i X X (a) (b) 2 • 1.5 + + 1 + .0- . +++ + + .#~O-+ ~+ +- 0.5 +~ ++ + + 0 0 • o oo o o~ - °o • .~°" o° l:~_ ÷" • . • ° .. +, -t- 1 ° 2 3 X (c) Fig. 5. (a) Data Set 1 for n = 100 and the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary. (b) Data Set for n = 1000 and the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary. (c) Data Set 1 for n 4000 and the boundary provided by SA classifier (marked with an arrow) along with Bayes decision boundary (class ] is represented by '+' and class 2 by 'o'). t~ o~ 4~ 182 S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184 It is obvious from the figures that as n increases from 100 to 4000, the SA line also approaches the Bayes line, so much so, that for n --4000, they lie very close to each other. 6. Discussion and conclusions A pattern classification methodology in ~N, using simulated annealing for search and placement of a number of hyperplanes in order to approximate the class boundaries of a given training data set, has been described. An extensive comparison of the methodology with other classifiers, namely the Bayes classifier (which is well known for discriminating overlapping classes), k-NN classifier and MLP (which are well known for discriminating non-overlapping, non-linear regions by generating piecewise linear boundaries) is also presented. The results of the proposed algorithm are seen to be comparable to, sometimes better than, them in discriminating both overlapping and non-overlapping, non-convex regions. A distinguishing feature of this approach is that the boundaries (approximated by piecewise linear segments) need to be generated explicitly for making decisions. This is unlike the conventional methods or the MLP based approaches, where the generation of boundaries is a consequence of the respective decision making processes. A theoretical analysis of the aforesaid classifier establishes that under limiting conditions of infinitely large training data sets, the error rate of the SA classifier during training is less than or equal to that of the Bayes classifier. This, in turn, indicates that when the size of the training data set goes to infinity, the boundary provided by the SA classifier approaches the Bayes boundary. This finding is also experimentally verified for a data set, generated using triangular distribution, where the Bayes boundary is known exactly. A comparison of SA with GA for this classification problem shows that both perform comparably in terms of the test recognition scores. This is expected, since both are stochastic optimization techniques, working on the same principle of approximating the class boundaries using a number of hyperplanes. In terms of string evaluations required to obtain the optimal performance, SA appears to score over GA. However, one must note that it is very difficult to obtain the actual number of distinct string evaluations in GA, since strings are often replicated. The actual number of distinct evaluations will, in fact, be a small fraction of the quantity (population size x number of iterations). Although SA is found to perform comparably to GA, there appear to be several factors contributing to the predominance of GAs in the literature. In SA, two main control parameters are to be selected appropriately in order to obtain good performance. These are the values of r (which controls the S. Bandyopadhyay et al. / Journal of Information Sciences 109 (1998) 165-184 183 sequence of T) and k (the number of iterations executed at each temperature). On the other hand, in GA, only the maximum number of iterations (or stopping time) must be appropriately selected. Other than this, both the methods need proper tuning of several other parameters, e.g., T~ax, Train in SA, probabilities for crossover, mutation in GA, etc. Additionally, in the advanced stages of the SA algorithm, the temperature values should be smaller than the smallest difference of the energy values in order to provide good performance. Since for the pattern classification problem, this value is 1 (minimum non-zero difference of number of misclassified points) and Tmin = 0.01, this requirement is met. GA, with roulette wheel selection, is, on the other hand, immune to this difference. Finally, since SA is inherently sequential in nature, not much improvement can be derived in parallel computing platforms, while there is scope for such improvement in GA. One must note that very basic versions of both SA and GA are used here. Use of enhanced models and improved operators for both SA and GA may provide better performance. For example, in case of SA, other cooling schedules [16,17] may be used. Similarly, modified versions of GA, like genetic algorithm with chromosome differentiation, (GACD), may be applied which has been found to improve the classification performance [18]. Acknowledgements This work was carried out when Ms. Sanghamitra Bandyopadhyay held a fellowship awarded by the Department of Atomic Energy, Govt. of India. References [1] P.J.M. van Laarhoven, E.H.L. Aarts, Simulated Annealing: Theory and Applications, Reidel, Dordrecht, 1987. [2] S. Kirkpatrick, C.D. Gelatt, Jr., M.P. Vechhi, Optimization by simulated annealing, Science 220 (1983) 671-680. [3] L. Davis, ed., Genetic Algorithms and Simulated Annealing, Pitman Publishing, London, 1987. [4] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer, Berlin, 1992. [5] S. Bandyopadhyay, C.A. Murthy, S.K. Pal, Pattern classification using genetic algorithms, Patt. Recog. Lett. 16 (8) (1995) 801-808. [6] D.E. Goldberg, Genetic Algorithms: Search, Optimization and Machine Learning, AddisonWesley, Reading, MA, 1989. [7] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, MA, 1974. [8] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the theory of neural computation, AddisonWesley, Reading, MA, 1991. 184 S. Bandyopadhyay et al. / Journal of lnformation Sciences 109 (1998) 165-184 [9] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, Equations of state calculations by fast computing machines, J. Chem. Phys. 21 (1953) 1087-1091. [10] S. Geman, D. Geman, Stochastic relaxation, gibbs distribution, and the bayesian restoration of images, IEEE Trans. Patt. Anal. Machine Intell. PAMI- 6 (1984) 721-741. [11] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973. [12] T.M. Apostol, Mathematical Analysis, Narosa, New Delhi, 1985. [13] R.B. Ash, Real Analysis and Probability, Academic Press, New York, 1972. [14] S.K. Pal, S. Mitra, Multilayer perceptron, fuzzy sets and classification, IEEE Trans. Neural Networks 3 (5) (1992) 683-697. [15] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 3 (1936) 179-188. [16] S. Shinomoto, Y. Kabashima, Finite time scaling of energy in simulated annealing, J. Phys. A 24 (1991) L141-LI44. [17] E.H.L. Aarts, P.J.M. van Laarhoven, A pedestrian review on the theory and applications of simulated annealing algorithm, in: J. van Hemmen, I. Morgenstern (Eds.), Heidelberg Colloquim on Glassy Dynamics, Springer, Berlin, 1987, pp. 288-307. [18] S. Bandyopadhyay, S.K. Pal, Pattern classification with genetic algorithms: Incorporation of chromosome differentiation, Pattern Recog. Lett. 18 (1997) 119-131.