Combining Classifiers by Constructive Induction

Joao Gama

Combining Classifiers by Constructive Induction

Joao Gama

1998

visibility

…

description

13 pages

link

1 file

Using multiple classifiers for increasing learning accuracy is an active research area. In this paper we present a new general method for merging classifiers. The basic idea of Cascade Generalization is to sequentially run the set of classifiers, at each step performing an extension of the original data set by adding new attributes. The new attributes are derived from the probability class distribution given by a base classifier. This constructive step extends the representational language for the high level classifiers, relaxing their bias. Cascade Generalization produces a single but structured model for the data that combines the model class representation of the base classifiers. We have performed an empirical evaluation of Cascade composition of three well known classifiers: Naive Bayes, Linear Discriminant, and C4.5. Composite models show an increase of performance, sometimes impressive, when compared with the corresponding single models, with significant statistical confidence levels.

Combining Classi ers by Constructive Induction Jo~ao Gama LIACC, FEP - University of Porto Rua Campo Alegre, 823 4150 Porto, Portugal Phone: (+351) 2 678830 Fax: (+351) 2 6003654 Email: [email protected] WWW: http://www.up.pt/liacc/ML Abstract. Using multiple classi ers for increasing learning accuracy is an active research area. In this paper we present a new general method for merging classi ers. The basic idea of Cascade Generalization is to sequentially run the set of classi ers, at each step performing an extension of the original data set by adding new attributes. The new attributes are derived from the probability class distribution given by a base classi er. This constructive step extends the representational language for the high level classi ers, relaxing their bias. Cascade Generalization produces a single but structured model for the data that combines the model class representation of the base classi ers. We have performed an empirical evaluation of Cascade composition of three well known classi ers: Naive Bayes, Linear Discriminant, and C4.5. Composite models show an increase of performance, sometimes impressive, when compared with the corresponding single models, with signi cant statistical con dence levels. 1 Introduction Given a learning task which algorithm should we use? Previous empirical studies have shown that there is no overall better algorithm. The ability of a chosen algorithm to induce a good generalization depends on how appropriate the class model underlying the algorithm is for the given task. An algorithm class model is the representation language it uses to express a generalization of the examples. The representation language for a standard decision tree is the DNF formalism that splits the instance space by axis-parallel hyper-planes, while the representation language for a linear discriminant function is a set of linear functions that split the instance space by oblique hyper-planes. In statistics, Henery[12] refers Rescaling as a method used when some classes are over-predicted leading to a bias. Rescaling consists on applying the algorithms in sequence, the output of an algorithm being used as input to another algorithm. The aim would be to use the estimated probabilities W = P (C X ) derived from a learning algorithm, as input to a second learning algorithm the purpose of which is to produce an unbiased estimate Q(C W ) of the conditional probability for class C . i ij i ij Since di erent learning algorithms employ di erent knowledge representations and search heuristics, di erent search spaces are explored and diverse results are obtained. The problem of nding the appropriate bias for a given task is an active research area. We can consider two main lines: on one side methods that select the most appropriate algorithm for the given task, for example Schaffer's selection by Cross-Validation, and on the other side, methods that combine predictions of di erent algorithms, for example Stacked Generalization [21]. The work that we present here follows the second research line. Instead of looking for methods that t the data using a single representation language, we present a family of algorithms, under the generic name of Cascade Generalization, whose search space contains models that use di erent representation languages. Cascade generalization performs an iterative composition of classiers. At each iteration a classi er is generated. The input space is extended by the addition of new attributes. Those new attributes are obtained in the form of a probability class distribution given, for each example, by the generated base classi er. The language of the nal classi er is the language used by the high level generalizer. But it uses terms that are expressions from the language of low level classi ers. In this sense, Cascade Generalization generates a uni ed theory from the base theories. The experimental work shows that this methodology usually improves the accuracy with signi cant statistical levels. The next section of the paper presents the framework of cascade generalization. In section 3, we present an illustrative example. In section 4 we review previous work in the area of multiple models. In section 5, we perform an empirical study using UCI data sets. The last section presents an analysis of the results and concludes the paper. 2 Cascade Generalization Consider a learning set D = (xn ; Yn ) n = 1; :::; N , where xn = [x1 ; :::; xm ] is a multidimensional input vector, and Yn is the output variable. Since the focus of this paper is on classi cation problems, Yn takes values from a set of pre de ned values, that is Yn 2 Cl1 ; :::; Clc , where c is the number of classes. A classi er = is a function that is applied to the training set in order to construct a predictor =(x; D) of y values. This is the traditional framework for classi cation tasks. Nevertheless, our framework requires that the predictor =(x; D) outputs a vector of conditional probability distribution [p1; :::; pc], where pi represents the probability that the example x belongs to class i, this is P (y = Cli jx). The class that is assigned to the example x, is that one that maximizes this last expression. Most of the commonly used classi ers, such as Naive Bayes and Discriminant, classify each example in this way. Other classi ers, for example C4.5, have a di erent strategy for classifying an example, but it requires small changes in order to obtain a probability class distribution. We de ne a constructive operator (D ; =(x; D)). This operator has two input parameters: a data set D' and a classi er =(x; D). The classi er = generates 0 a theory from the training data D. For each example x 2 D0 , the generated theory outputs a probability class distribution. The operator concatenates both the input vector x with the output probability class distribution. The output of (D0 ; =(x; D)) is a new data set D". The cardinality of D" is equal to the cardinality of D' (they have the same number of examples). Each example in x 2 D00 has an equivalent example in D', but augmented with c new attributes. The new attributes are the elements of the vector of probability class distribution obtained when applying classi er =(x; D) to the example x. Cascade generalization is a sequential composition of classi ers, that at each generalization level applies the operator. Given a training set L, a test set T, and two classi ers =1 , and =2 , Cascade generalization proceeds as follows: Using classi er =1 , generates the Level1 data: Level1train = (L; =1 (x; L)) Level1test = (T; =1 (x; L)) Classi er =2 learns on Level1 training data and classi es the Level1 test data: =2 (x; Level1 train) for each x 2 Level1test Those steps perform the basic sequence of a cascade generalization of classi er =2 after classi er =1 . We represent the basic sequence by the symbol r. The previous composition could be shortly represented by: =2 r=1 = =2 (x; (L; =1 (x ; L))) for each x 0 0 2 (T; =1 (x00 ; L)) This is the simplest formulation of Cascade Generalization. Some possible extensions include the composition of n classi ers, and the parallel composition of classi ers. A composition of n classi ers is represented by: =n r=n?1 r=n?2 :::r=1 In this case, Cascade Generalization generates n-1 levels of data. The high level theory, is that one given by the =n classi er. A variant of cascade generalization, which include several algorithms in parallel, could be represented in this formalism: =n+1 r[=1 ; :::; =n ] = =n+1 (x; (L; [=1 (x0 ; L); :::; =n (x0 ; L)])) for each x 2 (T; [=1 (x0 ; L); :::; =n (x0 ; L)]) The algorithms =1 , ..., =n run in parallel. The operator (L; [=1 (x ; L); :::; =n(x ; L)]) returns a new data set L0 . L0 contains the same number of examples of L. Each example on L0 contains n c new attributes, where c is the number of classes. Each algorithm in the set [=1 ; :::; =n ] contributes with c new attributes. 0 0 3 An illustrative Example In this example we will consider the UCI data set Monks-2 [20]. The Monk's problems are an arti cial robot domain, well known in the Machine Learning community. The robots are described by six di erent attributes and classi ed into one of two classes. We have chosen the Monks-2 problem because it is known that this is a dicult task for systems that learn decision trees in an attributevalue logic formalism. The decision rule for the problem is: \The robot is O.K. if exactly two of the six attributes have their rst value". This problem is similar to parity problems. It combines di erent attributes in a way which makes it complicated to describe in DNF or CNF using the given attributes only. Using ten fold Cross Validation, the error rate of C4.5 is 32.9%, and of Naive Bayes is 49.5%. The composite model C4.5 after Naive Bayes, 4 5 , operates as follows: the data was generated, using the Naive Bayes as the 1 classi er. C4.5 was used for the , 1 data. The composition 4 5 obtains an error rate of 17.8%, which is substantially lower than the error rates of both C4.5 and Naive Bayes. None of the algorithms in isolation can capture the underlying structure of the data. In this case, Cascade was able to achieve a notable increase of performance. Figure 1 presents one of the trees generated by 4 5 . The tree contains a mixture of p0 the original attributes (a3, a6) and the new attributes constructed by <= 0.6 > 0.6 Naive Bayes (p0). At the root of the tree, appears the attribute p0. p0 a3 This attribute is the conditional probability ( = x) given <= 0.4 > 0.4 1 2 by the Naive Bayes. The classi cation rule used by Naive Bayes is: a6 a6 0 a6 choose the that maximizes 2 1 1 2 1 2 ( x). The decision tree generated by C4.5 uses the constructed 0 a3 a3 a3 0 p0 attributes given by Naive Bayes, but rede ning di erent decision sur1 2 1 2 1 2 <= 0.7 > 0.7 faces. Because this is a two class problem, the Bayes rule uses 0 with 1 0 0 1 1 0 1 0 threshold 0.5, while the decision tree chose the threshold at 0.6. Those decision nodes are a kind of funcFig. 1. Tree generated by C4.5 Bayes tion given by the Bayes strategy. For example, the attribute p0 can be seen as a function that computes ( = x) using the Bayes theorem. The decision tree performs a sequence of tests based on the conditional probabilities given by the Bayes theorem. In a certain sense, this decision tree combines both representation languages: Bayes and Trees. The constructive step performed by Cascade, inserts new axis that incorporates new knowledge provided by the Naive Bayes. It is this new knowlC : rN aiveBayes C : rN aiveBayes Level Level C : rN aiveBayes p C lass F alsej C lassi p C lassi j p r p C lass F alsej edge that allows the signi cant increase of performance veri ed with the Decision Tree, despite the limitations of Naive Bayes to t complex spaces. It is this kind of synergies between classi ers that Cascade Generalization explores. 4 Related Work We can analyze previous work on the area of multiple models through two dimensions. One dimension is related to the di erent methods used for combining classi cations. The other dimension is related to the methods used for generating di erent models. 4.1 Combining Classi cations Combining classi cations usually occurs at classi cation time. We can consider two main lines of research. One group includes methods where all base classiers are consulted in order to classify a query example, the other, methods that characterize the area of expertise of the base classi ers and for a query point only ask the opinion of the experts. Voting is the most common method used to combine classi ers. As pointed in Ali[1], this strategy is motivated by the Bayesian learning theory which stipulates that in order to maximize the predictive accuracy, instead of using just a single learning model, one should ideally use all hypotheses (models) in the hypothesis space. The vote of each hypothesis should be weighted by the posterior probability of that hypothesis given the training data. Several variants of the voting method can be found in machine learning literature: from uniform voting where the opinion of all base classi ers contributes to the nal classi cation with the same strength, to weighted voting, where each base classi er has a weight associated, that could change over the time, and strengthens the classi cation given by the classi er. 4.2 Generating di erent models Buntine's Ph.D. thesis [5], refers to at least two di erent ways of generating multiple classi ers. The rst one, involves a single tree that is generated from the training set and then pruned back in di erent ways. The second method is referred to as Option Trees. These kind of trees are in e ect an ensemble of trees. Each decision node contains not only a univariate test, but also stores information about other promising tests. When using an Option Tree as a classi er the di erent options are consulted and the nal classi cation is given by voting. He shows that, if the goal is to obtain an increase of performance, the second method out performs the rst, basically, due to the fact that it produces di erent syntactic models. Breiman[2] proposes Bagging, that produces replications of the training set by sampling with replacement. Each replication of the training set has the same size as the original data, but some examples don't appear in it, while others may appear more than once. From each replication of the training set a classi er is generated. All classi ers are used in order to classify each example on the test set, usually using a uniform vote scheme. The Boosting algorithm from Freund and Schapire [9] maintains a weight for each example in the training set that re ects its importance. Adjusting the weights causes the learner to focus on di erent examples leading to di erent classi ers. Boosting is an iterative algorithm. At each iteration the weights are adjusted in order to re ect the performance of the corresponding classi er. The weight of the misclassi ed examples is increased. The nal classi er aggregates the learned classi er at each iteration by weighted voting. The weight of each classi er is a function of its accuracy. Wolpert [21] proposes Stacked Generalization, a technique that uses learning in two levels. A learning algorithm is used to determine how the outputs of the base classi ers should be combined. The original data set constitutes the level zero data. All the base classi ers run at this level. The level one data are the outputs of the base classi ers. Another learning process occurs using as input the level one data and as output the nal classi cation. This is a more sophisticated technique of cross validation that could reduce the error due to the bias. Brodley[4] presents MCS, a hybrid algorithm that combines in a single tree, nodes that are univariate tests, multivariate tests generated by linear machines, and instance based learners. At each node MCS uses a set of If-Then rules in order to perform a hill-climbing search for the best hypothesis space and search bias for the given partition of the dataset. The set of rules incorporates knowledge from expert domains. Gama[10, 11] presents Ltree, also a hybrid algorithm that combines a decision tree with a linear discriminant by means of constructive induction. Chan and Stolfo[6] presents two schemes for classi er combination: arbiter and combiner. Both schemes are based on meta learning, where a meta-classi er is generated from a training data, built based on the predictions of the base classiers. An arbiter is also a classi er and is used in order to arbitrate among predictions generated by the di erent base classi ers. Later[7], extended this framework using arbiters=combiners in an hierarchical fashion generating arbiter=combiner binary trees. 4.3 Discussion Reported results relative to Boosting or Bagging are quite impressive. Using 10 iterations (that is generating 10 classi ers) Quinlan[16] reports reductions of the error rate between 10% and 19%. Quinlan argues that these techniques are mainly applicable for unstable classi ers. Both techniques requires that the learning system should not be stable, in order to obtain di erent classi ers when there are small changes in the training set. Under an analysis of bias-variance decomposition of the error of a classi er[13], the reduction of the error observed when using Boosting or Bagging is mainly due to the reduction in the variance. Ali[1] refers to that "the number of training examples needed by Boosting increases as a function of the accuracy of the learned model. Boosting could not be used to learn many models on the modest training set sizes used in this paper.". Wolpert[21] says that successful implementations of Stacked Generalization is a "black art", for classi cation tasks and the conditions under which Stacked works are still unknown. Recently, Ting[18] have shown that successful stacked generalization requires to use output class distributions rather than class predictions. In their experiments, only the MLR algorithm (a linear discriminant) was suitable for level-1 generalizer. Tumer[19] presents analytical results that showed that the combined error rate depends on the error rate of individual classi ers and the correlation among them. This was con rmed in the empirical study presented in [1]. The main point of Cascade Generalization is its ability to merge di erent models. As such, we get a single model whose components are terms of the base model's language. The bias restriction imposed by using single model classes is relaxed in the directions given by the base classi ers. Cascade Generalization gives a single structured model for the data, and this is a strong advantage over the methods that combine classi ers by voting. Another advantage of Cascade Generalization is related to the use of probability class distributions. Usual learning algorithms produced by the Machine Learning community uses categories when classifying examples. Combining classi ers by means of categorical classes looses the strength of the classi er in its prediction. The use of probability class distributions allows us to explore that information. 5 Experiments 5.1 The Algorithms Ali [1] and Tumer[19] among other authors, suggest as a method that allows us to reduce the correlation errors, the use of "radically di erent types of classiers". This was the criterion that we have used in order to select the algorithms for the experimental work. We use three classi ers: a Naive Bayes, a Linear Discriminant, and a Decision Tree. Naive Bayes The Bayes approach in order to classify a new example E, is the use of Bayes theorem in order to compute the probability of each class C , given the example. The chosen class is the one that maximizes: p(C jE ) = p(C )p(E jC )=p(E ). If the attributes are independent, p(E jCi) can be decomposed into the product p(v1 jC ) ::: p(v jC ). Domingos [8] show that this procedure has a surprisingly good performance in a wide variety of domains, including many where there are clear dependencies between attributes. The required probabilities are computed from the training set. In the case of nominal attributes we use counts. Continuous attributes were discretized. The number of bins that we use is a function of the number of di erent values observed on the training set: k = min(10; nr: different values). This heuristic was used in [8] and elsewhere with good overall results. Missing values were treated as another possible value for the attribute, both on the training and test data. Naive i i i i i k i Bayes uses all the attributes in order to classify a query point. Langley [14] refers that Naive Bayes relies on an important assumption: that the variability of the dataset can be summarized by a single probabilistic description, and that these is sucient to distinguish between classes. From an analysis of Bias-Variance, this implies that Naive Bayes uses a reduced set of models to t the data. The result is low variance, but if the data cannot be adequately represented by the set of models, we obtain a large bias. Linear Discriminant A linear discriminant function is a linear composition of the attributes where the sum of the squared di erences between class means is maximal relative to the internal class variance. It is assumed that the attribute vectors for examples of class Ci are independent and follow a certain probability distribution with probability density function fi . A new point with attribute vector x is then assigned to that class for which the probability density function fi (x) is maximal. This means that the points for each class are distributed in a cluster centered at i . The boundary separating two classes is a hyper-plane and it passes through the mid point of the two centers. If there are only two classes one hyper-plane is needed to separate the classes. In the general case of q classes, q ? 1 hyper-planes are needed to separate the classes. By applying the linear discriminant procedure described below, we get qnode ? 1 hyper-planes. The equation of each hyper-plane is given by[12]: Hi = i+ P j ij xj where i = ? 21 Ti S ?1 i and i = S ?1i We use a Singular Value Decomposition (SVD) in order to compute S ?1. SVD is numerically stable and is a tool for detecting sources of collinearity. This last aspect is used as a method for reducing the features used at each linear combination. Discrim uses all, or almost all, the attributes in order to classify a query point. Breiman,[3] refers that from an analysis of Bias-Variance, Linear Discriminant is a stable classi er although it can t a small number of models. It achieves their stability by having a limited set of models to t the data. The result is low variance, but if the data cannot be adequately represented by the set of models, then the result is a large bias. Decision Tree We have used C4.5 (release 8) [17]. This is a well known decision tree generator and widely used by the Machine Learning community. In order to obtain a probability class distribution, we need to modify C4.5. C4.5 stores a distribution of the examples that fall at each leaf. From this distribution and using m-estimates 1 [15] we obtain a probability class distribution at each leaf. A Decision tree uses only a subset of the available attributes, in order to classify a query point. Breiman [3] among other researchers, note that Decision Trees are unstable classi ers. Small variations on the training set could cause large changes in the resulting predictors. These classi ers have high variance but they can t any kind of data: the bias of a decision tree is low. 1 In all the experiments reported m was set to 0.5. 5.2 The Datasets We have chosen 17 data sets from the UCI repository. All of them are well known and previously used in other comparative studies. In order to evaluate the proposed methodology we performed a 10 fold Cross Validation (CV) on the chosen datasets. Datasets were permuted once before the CV procedure. All algorithms where used with the default settings. In each iteration of CV, all algorithms were trained on the same training partition of the data. Classi ers were also evaluated on the same test partition of the data. Comparisons between algorithms were performed using t-paired tests with signi cant level set at 95%. Table 1 presents data sets characteristics and the error rate and standard deviation of each base classi er. Relative to each algorithm, + ( - ) sign in the rst column means that the error rate of this algorithm is signi cantly better (worse) than C4.5. These results provide an evidence, once more, that no single algorithm is better overall. Dataset Class Examples Types Australian 2 690 7 N 6 Cont Balance 3 625 4 Cont Breast 2 699 9 Cont Diabetes 2 768 8 Cont German 2 1000 24 Cont Glass 6 213 9 Cont Heart 2 270 6 N 7 Cont Ionosphere 2 351 33 Cont Iris 3 150 4 Cont Monks-1 2 432 6 Nom Monks-2 2 432 6 Nom Monks-3 2 432 6 Nom Satimage 6 6435 36 Cont Segment 7 2310 18 Cont Vehicle 4 846 18 Cont Waveform 3 2581 21 Cont Wine 3 178 13 Cont Bayes Discrim C4.5 13.8 3.5 14.1 6 15.3 6.3 - 28.8 6.3 + 13.3 4.4 22.3 5.3 2.4 1.9 + 4.1 6 6.1 6.1 25.7 5.5 + 22.7 5 24.8 6.6 27.7 4.4 + 24.0 6.2 29.1 3.7 - 41.8 12 - 41.3 11 32.3 9.6 + 16.7 5 16.7 3.6 19.9 7.2 9.1 6.3 - 13.4 5.4 9.1 5.8 6.0 4.9 2.0 3.2 4.7 4.5 - 25.0 3.9 - 33.3 11.3 2.3 4.4 - 49.6 9.0 34.0 5.9 32.9 5.9 - 2.8 2.4 - 22.5 8.7 0.0 0.0 - 18.8 1.5 - 16.1 1.5 13.9 1.3 - 9.5 2.1 - 8.3 2.5 3.3 1.3 - 41.4 3.9 + 22.2 5.1 28.8 3.9 + 18.8 1.5 + 15.3 2 24.0 2.2 2.8 4 1.7 3.8 6.7 8.2 20.1 17.9 16.2 Average of Error rates Table 1. Data Characteristics and Results of Base Classi ers 5.3 Cascade Generalization We have run all the possible two level combinations of base classi ers. Table 2(a) presents the results of using C4.5 at the top level. Each column corresponds to a Cascade Generalization combination. For each combination, the signi cance of t test is presented comparing the composite model with the individual components, in the same order that they appear on the header. The trend on these results is a clear improvement over the base classi ers. We never observe an error rate degradation of a composite model in relation to the individual components. Using C4.5 as the high level classi er the performance is improved with a signi cant statistical level of 95%, 22 times over one of the components, and it degradated 5 times. Using Naive Bayes at the top, there Dataset C4.5rBayes C4.5rC4.5 C4.5rDisc C4rDisrBay Stacked Gen Australian 14.3 3.1 15.2 6.2 14.8 6.1 13.6 6 14.3 5 Balance + + 6.1 2.8 22.1 5.2 + + 5.4 2.0 6.6 2 + 12.5 5 Breast(W) 2.8 1.7 5.6 4.8 + 4.1 6.0 2.7 2 2.4 2 Diabetes 25.2 6.9 24.4 6.9 24.2 5.8 24.8 9 22.4 6 German + + 24.9 4.4 29.1 3.7 26.2 6.0 28.4 4 25.0 5 Glass 38.1 9.6 32.3 12.0 36.1 10.9 37.6 12 34.3 12 Heart - 21.1 4.9 20.0 7.2 17.8 5.5 17.0 5 16.3 8 Iono - 13.1 6.6 8.9 5.8 13.1 5.1 10.6 6 10.6 6 Iris 6.7 5.4 4.7 4.5 3.4 3.4 3.3 4 4.0 3 Monks-1 + 1.6 3.1 2.3 4.4 + 2.3 4.4 1.4 3 2.3 4 Monks-2 + + 17.9 9.9 32.9 5.9 32.9 5.9 16.7 9 + 32.9 6 Monks-3 + 0.4 0.9 0.0 0.0 + 0.0 0.0 0.2 1 + 2.1 2 Satimage + + 13.0 1.4 13.7 1.3 + + 12.4 1.5 13.0 1 13.5 1 Segment + 4.0 0.8 3.2 1.4 + 3.4 1.5 3.0 1 3.4 1 Vehicle + 27.4 5.9 28.2 4.3 - 28.2 4.3 22.1 3 + 29.2 4 Waveform + 17.2 2.3 - 24.4 2.0 + 16.6 1.4 16.9 2 16.5 2 Wine 3.9 4.6 6.7 8.2 2.2 3.9 3.4 4 2.8 4 Mean 13.9 16.1 14.3 13.0 14.4 Table 2. (a) Results of Cascade Generalization. (b)Comparison with Stacked are 21 cases against 9. Using Discrim at the top, there are 22 cases against 7. In same cases, there is a signi cant increase of performance when compared to all the components. For example, the composition C4.5 Naive Bayes improves, with statistical signi cance, both components on 4 datasets, C4.5 Discrim and Naive Bayes Discrim on 2 datasets, and Discrim C4.5 on 1 dataset. The most promising combinations are C4.5 Discrim and C4.5 Naive Bayes. The new attributes built by Discrim or Naive Bayes set relations between attributes, that are outside the scope of DNF algorithms like C4.5. Those new attributes systematically appears at the root of the composite models. A particular successful application of Cascade is on Balance dataset. In another experiment, we have compared C4 Discrim Bayes against Stacked Generalization, which was reimplemented following the method of Ting[18]. In this scheme Discrim is the level1 algorithm. C4.5 and Bayes are the level0 algorithms. The attributes of the level1 data are the conditional probabilities P (Ci x), given by the level0 classi ers. The level1 data is built using a (internal) 5 fold strati ed cross validation. On those datasets, C4 Discrim Bayes performs signi cantly better on 4 datasets and worst on one. Cascade competes well with Stacked method, with the advantage that it doesn't use the internal cross validation. r r r r r r r r j r r 5.4 How far from the best ? Error rates are not comparable between datasets. Although the t paired tests procedure is commonly used for determining whether two means are statistically different, this procedure only permits to compare two algorithms. We are interested in comparisons which involve several algorithms. As such, for each dataset we identify the classi er with lowest error rate. Call it Elow . Denote the error rate of algorithmi on the given dataset as Ealg . Now we compute the Error margin as i C4rDis C4rBay SG C4rC4 C4 BayrC4 DisrBay Disc BayrDis DisrC4 DisrDis BayrBay Bayes 1.9 1.9 2.5 4.8 4.9 5.5 5.9 7.0 7.1 7.1 7.5 8.3 8.9 Table 3. Average of Distances to Best p the standard deviation of a Bernoulli distribution: Em = Elow (1 ? Elow )=N where N is the number of examples in the test set. For each algorithm in comparison, we compute the distance to the best algorithm in terms of Em . That is, low value of Distancei, means that the algorithmi has an error rate similar to the best algorithm, whilst high value means that the performance of algorithmi is far from that of the best algorithm. The goal of this analysis is to compare algorithm performance across datasets. Em is a criterion that can give insights about the diculty of the problem. Table 3 summarizes the averages of distances of all models. 6 Conclusions This work presents a new methodology for classi er combination. The basic idea of Cascade Generalization consists on a reformulation of the input space by means of insertion of new attributes. The new attributes are obtained by applying a base classi er. The number of new attributes is equal to the number of classes, and for each example, they are computed as the conditional probability of the example belonging to classi given by the base classi er. The new attributes are terms, or functions, in the representational language of the base classi er. This constructive step acts as a way of extending the description language bias of the high level classi ers. There are two main points that di erentiate Cascade Generalization from other previous methods on multiple models. The rst one is related with its ability in merging di erent models. We get a single model whose components are terms of the base model's language. The bias restrictions imposed by using single model classes are relaxed in the directions given by the base classi ers. This aspect is explored by combinations like C4.5rDiscrim or C4.5rNaive Bayes. The new attributes built by Discrim or Naive Bayes set relations between attributes, that are outside the scope of DNF algorithms like C4.5. Those new attributes systematically appears at the root of the composite models. Cascade Generalization gives a single structured model for the data, and in this way is more adapted to capture insights about problem structure. The second point is related to the use of probability class distributions. The use of probability class distributions allows us to exploit the information about the strength of the classi er. This is very useful information, especially when combining predictions of classi ers. We have shown that this methodology can improve the accuracy of the base classi ers, preserving the ability to provide a single and structured model for the data. Acknowledgments: Gratitude is expressed to the support given by the FEDER and PRAXIS XXI projects and the Plurianual support attributed to LIACC. Also to P.Brazdil and the anonymous reviewers for usefull comments. References 1. Ali, K. and Pazzani, M. (1996) "Error reduction through Learning Multiple Descriptions", in Machine Learning, Vol. 24, No. 1 Kluwer Academic Publishers 2. Breiman,L. (1996) "Bagging predictors", in Machine Learning, 24 Kluwer Academic Publishers 3. Breiman,L. (1996) "Bias, Variance, and Arcing Classi ers", Technical Report 460, Statistics Department, University of California 4. Brodley, C. (1995) "Recursive Automatic Bias Selection for Classi er Construction", in Machine Learning, 20, 1995, Kluwer Academic Publishers 5. Buntine, W. (1990) "A theory of Learning Classi cation Rules", Phd Thesis, University of Sydney 6. Chan P. and Stolfo S., (1995) "A Comparative Evaluation of Voting and Metalearning on Partitioned Data", in Machine Learning Proc of 12th International Conference, Ed. L.Saitta 7. Chan P. and Stolfo S. (1995) "Learning Arbiter and Combiner Trees from Partitioned Data for Scaling Machine Learning", KDD 95 8. Domingos P. and Pazzani M. (1996) "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classi er", in Machine Learning Proc. of 12th International Conference, Ed. L.Saitta 9. Freund, Y. and Schapire, R (1996) "Experiments with a new boosting algorithm", in Machine Learning Proc of 13th International Conference, Ed. L. Saitta 10. Gama, J, (1997) "Probabilistic Linear Tree", in Machine Learning Proc. of the 14th International Conference Ed. D.Fisher 11. Gama,J. (1997) \Oblique Linear Tree", in Advances in Intelligent Data Analysis Reasoning about Data', Ed. X.Liu, P.Cohen, M.Berthold, Springer Verlag LNCS 12. Henery R. (1997) \Combining Classi cation Procedures" in Machine Learning and Statistics. The Interface. Ed. Nakhaeizadeh, C. Taylor, John Wiley & Sons, Inc. 13. Kohavi, R and Wolpert, D. (1996) "Bias plus Variance Decomposition for zero-one loss function", in Machine Learning Proc of 13th International Conference, Ed. Lorenza Saitta 14. Langley P. (1993) "Induction of recursive Bayesian Classi ers", in Machine Learning: ECML-93 Ed. P.Brazdil, LNAI n667, Springer Verlag 15. Mitchell T. (1997) Machine Learning, MacGraw-Hill Companies, Inc. 16. Quinlan R., (1996) "Bagging, Boosting and C4.5", Procs. 13th American Association for Arti cial Intelligence, AAAI Press 17. Quinlan, R. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc. 18. Ting K.M. and Witten I.H. (1997) "Stacked Generalization: when does it work ?", in Procs. International Joint Conference on Arti cial Intelligence 19. Tumer K. and Ghosh J. (1995) "Classi er combining: analytical results and implications", in Proceedings of Workshop in Induction of Multiple Learning Models 20. Thrun S., et all, (1991) The Monk's problems: A performance Comparison of different Learning Algorithms, CMU-CS-91-197 21. Wolpert D. (1992) "Stacked Generalization", Neural Networks Vol.5, Pergamon Press This article was processed using the LATEX macro package with LLNCS style

Log In

Combining Classifiers by Constructive Induction

Related papers

Related papers

Related topics