Academia.eduAcademia.edu

Combining Models to Improve Classifier Accuracy and Robustness 1

1999

Recent years have shown an explosion in research related to the combination of predictions from individual classification or estimation models, and results have been very promising. By combining predictions, more robust and accurate models are almost guaranteed to be generated without the need for the high-degree of fine-tuning required for singlemodel solutions. Typically, however, the models for the combination process are drawn from the same model family, though this need not be the case. This paper summarizes the current direction of research in combining models, and then demonstrates a process for combining models from diverse algorithm families. Results for two datasets are shown and compared with the most popular methods for combining models within algorithm families.

Combining Models to Improve Classifier Accuracy and Robustness1 Dean W. Abbott Abbott Consulting P.O. Box 22536 San Diego, CA 92192-2536 USA Email: [email protected] combining models from different classifier algorithm families, Bundling [2]. The same concept has been described as Ensemble of Classifiers by Dietterich [3], Committee of Experts by Steinberg [4], and Perturb and Combine (P&C) by Breiman [5]. The concept is actually quite simple: train several models from the same dataset, or from samples of the same dataset, and combine the output predictions, typically by voting for classification problems and averaging output values for estimation problems. The improvements in model accuracy have been so significant, Friedman el al [6] stated about one form of model combining (boosting) "is one of the most important recent developments in classification methodology." Abstract Recent years have shown an explosion in research related to the combination of predictions from individual classification or estimation models, and results have been very promising. By combining predictions, more robust and accurate models are almost guaranteed to be generated without the need for the high-degree of fine-tuning required for singlemodel solutions. Typically, however, the models for the combination process are drawn from the same model family, though this need not be the case. This paper summarizes the current direction of research in combining models, and then demonstrates a process for combining models from diverse algorithm families. Results for two datasets are shown and compared with the most popular methods for combining models within algorithm families. Key Words: Data mining, model combining, There is a growing base of support in the literature for model combining providing improved model performance. Wolpert [7] used regression to combine neural network models (Stacking). Breiman [8] introduced Bagging, which combines outputs from decision tree models generated from bootstrap samples (with replacement) of a training data set. Models are combined by simple voting. Fruend and Shapire [9] introduced Boosting, an iterative process of weighting more heavily cases classified classification, boosting 1. Introduction Many terms have been used to describe the concept of model combining in recent years. Elder and Pregibon [1] used the term Blending to describe "the ancient statistical adage that 'in many counselors there is safety'". Elder later called this technique, particularly applied to 1 Presented at the 1999 International Conference on Information Fusion—Fusion99, Sunnyvale, CA, July 6-8, 1999. Abbott—Fusion99 1 incorrectly by decision tree models, and then combining all the models generated during the process. ARCing by Breiman [5] is a form of boosting that, like boosting weighs incorrectly classified cases more heavily, but instead of the Fruend and Shapire formula for weighting, weighted random samples are drawn from the training data. These are just a few of the most popular algorithms currently described in the literature, and researchers have developed many more methods as well. bagging and arcing improving single CART models on 11 machine learning datasets in every case. Additionally, he documents that arcing, using no special data preprocessing or classifier manipulation (just read the data and create the model), often achieves the performance of handcrafted classifiers that were tailored specifically for the data. However, it seems that producing relatively uncorrelated output predictions in the models to be combined is necessary to reduce error rates. If output predictions are highly correlated, little reduction in error is possible as the "committee of experts" have no diversity to draw from, and therefore no means to overcome erroneous predictions. Decision trees are very unstable in this regard as small perturbations in the training data set can produce large differences in the structure (and predictions) of a model. Neural networks are sensitive to data used to train the models and to the many training parameters and random number seeds that need to be specified by the analyst. Indeed, many researchers merely train neural network models changing nothing but the random seed for weight initialization to find models that have not converged prematurely in local minima. Polynomial networks have considerable structural instability, as different datasets can produce significantly different models, though many of the differences in models produce correlated results; there are many ways to achieve nearly the same solution. While most of the combining algorithms described above were used to improve decision tree models, combining can be used more broadly. Trees often show benefits from combining because the performance of individual trees are typically worse than other data mining methods such as neural networks and polynomial networks, and because they tend to be structurally unstable. In other words, small perturbations in training data set for decision trees can result in very different model structures and splits. Nevertheless, results for any data mining algorithm that can produce significant model variations can be improved through model combining, including neural networks and polynomial networks. Regression, on the other hand, is not easily improved through combining models because it produces very stable and robust models. It is difficult through sampling of training data or model input selection to change the behavior of regression models significantly enough to provide the diversity needed for combining to improve single models. A strong case can be made for combining models across algorithm families as a means of providing uncorrelated output estimates because the difference in basis functions used to build the model. For example, decision trees produce staircase decision boundaries via rules effecting one variable at a time. Neural networks produce While the reasons combining models works so well are not rigorously understood, there is ample evidence that improvements over single models are typical. Breiman [5] demonstrates Abbott—Fusion99 2 smooth decision boundaries from linear basis functions and a squashing function, and polynomial networks use cubic polynomials to produce an even smoother decision boundary. Abbott [10] showed considerable differences in classifier performance class by class— information that is clear to once classifier is obscure to another. Since it is difficult to gauge a priori which algorithm(s) will produce the lowest error for each domain (on unseen data), combining models across algorithm families mitigates that risk by including contributions from all the families. Table 2.1: Model Combination Voting Weights to Break Ties Model Rank on Training Data First Second Third Fourth Fifth Sixth 1.28 1.22 1.16 1.10 1.05 1.00 The six algorithms used were neural networks, decision trees, k-nearest neighbor, Gaussian mixture models, radial basis functions, and nearest cluster models. Five of the six models for each dataset were created using the PRW by Unica Technologies [11], and the sixth model (C5 decision trees) was created using Clementine by SPSS [12]. Full descriptions of the algorithms can be found in Kennedy, Lee, et al [13]. 2. Method for Combining Models Model combining done here expands on the bundling research done by Elder [2]. Models from six algorithm families were trained for each dataset. To determine which model to use for each algorithm family, dozens to hundreds of models were trained and only the single best was retained; only the best model from each algorithm family was represented. 2.2. Datasets The two datasets used are the glass data from the UCI machine learning data repository [14] and the satellite data used in the Statlog project [15]. Characteristics of the datasets are shown in Table 2.2: 2.1. Algorithms and Combining Method Once the six models were obtained, they were combined in every unique combination possible, including all two, three-, four-, five-, and sixway combinations. Each of the combinations was achieved by a simple voting mechanism, with each algorithm model having one vote. To break ties, however, a slight weighting factor was used, with the models having the best performance during training given slightly larger weight (Table 2.1). For example, if an example in the evaluation dataset had one vote from a first-ranked model, and another from a secondranked model, the first-ranked model would win the vote 1.28 to 1.22. The numbers themselves are arbitrary, and only need to provide a means to break ties. Abbott—Fusion99 Weight Table 2.2: Dataset Characteristics Number Examples Dataset Glass Satellite Train 150 3105 Test 0 1330 Eval 64 2000 Number Inputs/Outputs Vars Classes 9 6 36 6 Training data refers to the cases that were used to find model weights and parameters. Testing data was used to check the training results on independent data, and was used ultimately to select which model would be selected from those trained. Training and testing data split randomly, with 70% of the data used for training, 30% for testing. No testing data was 3 used for the glass dataset because so few examples were available; models were trained and pruned to reduce the risk of overfitting the data. Table 3.1: Number of Model Combinations Number Models 1 2 3 4 5 6 A third, separate dataset, the evaluation dataset, was used to report all results shown in this paper. The evaluation data was not used during the model selection process, only to score the individual and combined models, so that bias would not be introduced. The glass data was split in such as way as to retain the relative class representation in both the training and evaluation datasets. Number Combos 6 15 20 15 6 1 The emphasis here is on minimizing classifier error without going through the process of finetuning the classifiers with domain knowledge to improve performance—a necessary step for realworld applications. 3.1. Glass Dataset Results The single best models for each algorithm family is shown in Figure 3.1 below. Results are presented in terms of classification errors, so smaller numbers (shorter bars) are better. For each model, a search for the best model parameters was performed first, increasing the likelihood that the best model for each algorithm was found. A breakdown of the number of cases per class is shown in following two tables, 2.3 and 2.4: Table 2.3: Glass Data Class Breakdown Number Examples Train Eval 49 21 54 22 12 5 9 4 6 3 20 9 Class 1 2 3 5 6 7 Nearest neighbor had perfect training results (by definition), and the best remaining algorithms were, in order, neural networks, decision trees, Gaussian mixture, nearest cluster, and radial basis functions, and ranged from 28.1% error to 37.5% error. Interestingly, nearest cluster and Gaussian mixture models, both using PDF measures, had the best on evaluation data. Note that there are no examples for class 4. Table 2.4: Satellite Data Class Breakdown Class 1 2 3 4 5 7 Train 752 323 649 283 343 755 Number Examples Test 320 156 312 132 127 283 Eval 461 224 397 212 237 470 Note that there are no examples for class 6. Model combinations produced the following results shown in Figure 3.2. Not all datapoints can be seen as model combinations sometimes produce identical error scores. Two interesting trends can be seen in the figure. First, the trend is for the percent classification error to decrease as the number of models combined increases, though the very best (lowest classification error) 3. Results Results are compiled single models for each of the six algorithms, and all possible model combinations. Abbott—Fusion99 4 40% Percent Classification Error case occurs with 3 or 4 models. The lower error rate (23.4%) occurs for the combinations in Table 3.1 below. Amazingly, radial basis functions occur in all four of the best combination, even though it was clearly the single worst classifier. Each of the other classifiers was represented exactly twice except the Gaussian mixture which occurred once. Radial basis functions also appeared in two of the four worst combinations of more than 3 classifiers as well (Table 3.2), so it appears this algorithm is a wild card, and one cannot tell from the training result alone whether or not it will combine well. The worst 2-way models always include neural networks or k-nearest neighbor, and in these cases, the models were not improved compared to the single model results (34.4% for k-nearest neighbor, 31.3 for neural networks). Training Testing Percent Classification Error 10% k-Nearest Neighbor Decision Trees Decision Trees k-Nearest Neighbor Nearest Cluster Gaussian Mixture Nearest Cluster 6 k-Nearest Neighbor 16.0% Combinations Nearest Cluster k-Nearest Neighbor k-Nearest Neighbor 20% Radial Basis Functions Radial Basis Functions Radial Basis Functions Radial Basis Functions Neural Networks Table 3.2 Greater than 3-way Yielding Highest Error Rate (29.7%) 25% 15% 14.0% 3 4 5 Number Models Combined Neural Networks 29.7% 28.1% 2 Table 3.1 Combinations Yielding Lowest Error Rate Decision Trees Decision Trees, 31.3% 28.1% 15% Figure 3.2: Combine Model Results on Glass Data 37.5% 30% 20% 1 34.4% 35% 25% 0% 42.0% 40% 30% 5% Evaluation 45% 35% Nearest Cluster Nearest Cluster Radial Basis Fn. Radial Basis Fn. Neural Networks Neural Networks 10.0% 10% 6.0% When the combination model results are represented only by the error summary statistics (minimum, maximum, and average), the trends become clearer, as seen in Figure 3.3. 0 0 0 Radial Basis F unction Decision T ree (C5) 0 Nearest Cluster 0.0% 0 K Nearest Neighbor 0 G aussian Mixture 0% Neural Network 5% Algorithm Max Error Min Error Average Error 40% Percent Classification Error Figure 3.1: Single Model Results on Glass Data 35% 30% 25% 20% 15% 10% 5% 0% 1 Abbott—Fusion99 5 2 3 4 5 Number Models Combined 6 Training Figure 3.3: Combine Model Trend on Glass Data Testing Evaluation 16% 16% 15% 14% 13.2% 15% 10% CART bagging arcing (boosting) Best Combination 5% G aussian Mixture Percent Classification Error 23.2% 20% 0% 7.9% 7.9% 8% 6% 5.5% 6% 4% Radial Basis F unction Neural Network G aussian Mixture Decision T ree (C5) 2% Algorithm Min 16% with Breiman Arcing. 14% Percent Classification Error Figure 3.4: Comparison of Model Combination 3.2. Satellite Dataset Results Results for the satellite data are similar to the glass data. First see in Figure 3.5 the train, test, and evaluation results for the single models. Results are more uniform than for the glass data, but radial basis functions and decision trees are the worst performers on evaluation data. The best are nearest neighbor, neural networks, and Gaussian mixture models. Abbott—Fusion99 9% The trends are shown in Figure 3.6. Once again the errors and the spread between maximum and minimum errors are both reduced as the number of combined models increases, though once again the very best models occur for the 3-way combination (k-Nearest Neighbor, Neural Network, Radial Basis Function, and the same three with Decision Trees). Once again, radial basis functions are involved in the best combination, and again are also involved in the worst combination models. 30.4% 21.6% 9% Figure 3.5: Single Model Results on Satellite Data 28.1% 23.4% 10% 8.9% 10% 11% 11% Algorithm 35% 25% 12.6% 12% 12% 0% Evaluation 30% 12% Nearest Cluster The average model error never gets worse as more models are added to the combinations. Additionally, the spread between the best and the worst shrinks as the number of models combined increases: both bias and variance are reduced: the error was reduced by 4.7%, a 16.7% error reduction compared to the best Gaussian mixture model which had 28.1% error. However, the reduction found here is not as good as the reduction found by Brieman [5] using boosting (Figure 3.4), which brought the error down to 21.6%. K Nearest Neighbor Percent Classification Error 14% 14% Max Average 12% 10% 8% 6% 4% 2% 0% 1 2 3 4 5 Number Models Combined 6 Figure 3.6: Combine Model Trend on Satellite Data Comparing the model combination results to Breiman's results using Arcing (boosting) shows once again the boosting algorithm performing 6 better, though the combination betters bagging by a small amount here. References [1] Elder, J.F., and Pregibon, D. 1995. A Statistical Perspective on Knowledge Discovery in Databases. Advances in Knowledge Discovery and Data Mining. U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, Editors. AAAI/MIT Press. [2] Elder, J. F. IV, D.W. Abbott, Fusing Diverse Algorithms,. 29th Symposium on the Interface, Houston, TX, May 14-17, 1997. [3] Dietterich, T. 1997. Machine-Learning Research: Four Current Directions. AI Magazine. 18(4): 97136. [4] D. Steinberg, CART Users Manual, Salford Systems. 1997. [5] Breiman, L. 1996. "Arcing Classifiers", Technical Report, July 1996. [6] J. Friedman, T. Hastie, and R. Tibsharani, "Additive Logistic Regression: A Statistical View of Boosting", Technical Report, Stanford University, 1998. [7] D.H. Wolpert, Stacked generalization, Neural Networks 5: 241-259, 1992. [8] L. Breiman, "Bagging predictors", Machine Learning 24: 123-140, 1996. [9] Y. Freund, and R.E. Shapire, "Experiments with a new boosting algorithm", In Proceedings of the Thirteenth International Conference on Machine Learning, July 1996. [10] Abbott, D.W., "Comparison of Data Analysis and Classification Algorithms for Automatic Target Recognition", Proc. of the 1994 IEEE International Conference on Systems, Man, and Cybernetics, San Antonio, TX, October, 1994. [11] Unica Technologies, Inc., http://www.unica-usa.com [12] SPSS, Inc., http://www.spss.com/software/clementine [13] R.L. Kennedy, Y. Lee, BV Roy, C.D. Reed, and R.P. Lippman, Solving Data Mining Problem Through Pattern Recognition, Prentice Hall, PTR, Upper Saddle River, NJ, 1997. [14] http://www.ics.uci.edu/~mlearn/MLRepository.html [15] D. Michie, D. Spiegelhalter, and C. Taylor, Machine Learning, Neural and Statistical Classification, Ellis Horwood, London, 1994. Evaluation 14.8% 14% 12% 10.6% 10% 10.3% 9.9% 9.0% 8% 6% 4% CART bagging Arcing (boosting) 0% Best Combination 2% G aussian Mixture Percent Classification Error 16% Algorithm Figure 3.7: Comparison of Model Combination with Breiman Arcing. 4. Conclusions and Discussion Clearly, combining models improves model accuracy and reduces model variance, and the more models combined (up to the number investigated in this paper), the better the result. However, determining which individual models combine best from training results only is difficult—there is no clear trend. Simply selecting the best individual models does not necessarily lead to a better combined result. While combining models across algorithm families reduces error compared to the best single models, it does not perform as well as boosting. The advantage of boosting over simple model combining is that boosting acts directly to reduce error cases, whereas combining works indirectly. The model combining voting methods are not tuned to take into account the confidence that a classification decision is made correctly, nor do they concentrate more heavily on the difficult cases. More research is necessary to confirm these suggested explanations. Abbott—Fusion99 7