Academia.eduAcademia.edu

A tree-based algorithm for attribute selection

2017, Applied Intelligence

This paper presents an improved version of a decision tree-based filter algorithm for attribute selection. This algorithm can be seen as a pre-processing step of induction algorithms of machine learning and data mining tasks. The filter was evaluated based on thirty medical datasets considering its execution time, data compression ability and AUC (Area Under ROC Curve) performance. On average, our filter was faster than Relief-F but slower than both CFS and Gain Ratio. However for low-density (highdimensional) datasets, our approach selected less than 2% of all attributes at the same time that it did not produce performance degradation during its further evaluation based on five different machine learning algorithms.

Appl Intell DOI 10.1007/s10489-017-1008-y A tree-based algorithm for attribute selection José Augusto Baranauskas1 Alessandra Alaniz Macedo1 · Oscar Picchi Netto1 · Sérgio Ricardo Nozawa2 · © Springer Science+Business Media, LLC 2017 Abstract This paper presents an improved version of a decision tree-based filter algorithm for attribute selection. This algorithm can be seen as a pre-processing step of induction algorithms of machine learning and data mining tasks. The filter was evaluated based on thirty medical datasets considering its execution time, data compression ability and AUC (Area Under ROC Curve) performance. On average, our filter was faster than Relief-F but slower than both CFS and Gain Ratio. However for low-density (highdimensional) datasets, our approach selected less than 2% of all attributes at the same time that it did not produce performance degradation during its further evaluation based on five different machine learning algorithms. Keywords Attribute selection · Filter · Decision tree · High dimensional data · Data pre-processing  José Augusto Baranauskas [email protected] Oscar Picchi Netto [email protected] Sérgio Ricardo Nozawa [email protected] Alessandra Alaniz Macedo [email protected] 1 Department of Computer Science and Mathematics, Faculty of Philosophy, Sciences and Languages at Ribeirao Preto, University of Sao Paulo (USP), Av. Bandeirantes, 3900, Ribeirão Preto, SP, 14040-901, Brazil 2 Dow AgroSciences (Seeds, Traits, Oils), Av. Antonio Diederichsen, 400, Ribeirão Preto, SP, 14020-250, Brazil 1 Introduction Data Mining (DM) is an interdisciplinary field that brings together techniques from Machine Learning, statistics, pattern recognition, databases and visualization techniques to address the issue of extracting high-level knowledge from low-level data in large databases [1]. When Machine Learning (ML) techniques are used for DM, where the number of records (instances) is very large, several representative samples from the database are usually taken and presented to an ML algorithm. Then, knowledge extracted from these samples by ML algorithms is combined in some way [2]. The exponential growth in the amount of available biological data raises two problems: efficient information storage and management and extraction of useful information from this data [3]. Regarding the use of ML in DM, one important issue to consider is reducing the dimensionality of database records, which can be achieved by reducing the number of records attributes (i.e. deleting columns on tables in the database literature or features/attributes in the Machine Learning literature). The data subset resulting from these deletions maintains the same number of instances but only a subset of features with predictive performance comparable to the full set of features remain. This process of attribute elimination is known as the Feature Subset Selection (FSS) problem, where one of the central issues is selection of relevant features and/or elimination of irrelevant ones. Using the ML and DM algorithms is a strategy to extract information more efficiently. However, when the amount of data is huge, the use of an efficient FSS algorithm is sometimes essential not only to speed up algorithms but also to reduce data that can be benchmark tested. This is why J. A. Baranauskas et al. FSS, initially an illustrative example, has become a real prerequisite for building models [1]. In the particular case of medical or biological data analysis or even in text mining, the amount of data is huge, and an FSS algorithm can help to reduce it. There are several reasons to conduct FSS. First, FSS generally improves accuracy because many ML algorithms perform poorly when given too many features. Second, FSS may improve comprehensibility, which is the ability of humans to understand the data and the classification rules induced by symbolic ML algorithms, such as rules and decision trees. Finally, FSS can reduce measurement cost because measuring features may be expensive in some domains. In this study, we present an approach to FSS that employs decision trees within a filter algorithm [4]. This work is organized as follows: Section 2 presents the basic concepts of the FSS problem; Section 3 describes the algorithm proposed in this study; Section 4 shows the experimental setup used to evaluate the proposed algorithm; Section 5 shows the experiments and discusses the results; and Section 6 presents the conclusion of this study. 2 Feature subset selection Supervised learning is the process of automatically creating a classification model from a set of instances (records or examples) called the training set which belongs to a set of classes. There are two aspects to consider in this process: the features that should be used to describe the concept and the combination of these features. Once a model (classifier) is created, it can help to predict the class of other unclassified examples automatically. In other words, in supervised learning, an inducer is given a set of N training examples containing A attributes. Each example x is an element of the set F1 × F2 × . . . × FA , where Fj is the domain of the j th feature. Training examples are tuples (x, y) where y is the label, output or class. The y values are typically drawn from a discrete set of c classes {1, . . . , c} in the case of classification or from the real values in the case of regression. In this work, we will refer to classification. Given a set of training examples, the learning algorithm (inducer) outputs a classifier such that, given a new instance, it accurately predicts the label y. One of the central problems in supervised learning is selection of useful features. Although most learning methods attempt to either select features or assign them degrees of importance, both theoretical analysis and experimental studies indicate that many algorithms scale poorly to domains with large numbers of irrelevant features. For example, the number of training cases that are necessary for the simple nearest neighbor to reach a given level of accuracy appears to grow exponentially with the number of irrelevant features, independently of the target concept. Even methods that induce univariate decision trees, which explicitly select some attributes in favor of others, exhibit this behavior for some target concepts. Some techniques, like the Naı̈ve Bayes classifier, are robust with respect to irrelevant features but they can be very sensitive to domains with correlated features, even if they are relevant. Assuming that this sort of technique is related to independence among features, additional methods might be necessary to select a useful subset of features when many features are available [5]. For instance, biological and medical domains often impose difficult obstacles to learning algorithms such as high dimensionality, a huge or very small amount of instances, several possible class values, and unbalanced classes. This may explain why researchers are still proposing a variety of algorithms although research on FSS is not new in the ML community [6–10]. According to [11], approaches to feature selection developed in the research literature can be grouped into three classes: (i) approaches that embed the selection within the basic induction algorithm, (ii) approaches that use feature selection to filter features during a pre-processing step while ignoring the induction algorithm, and (iii) approaches that treat feature selection as a wrapper around the induction process, using the induction algorithm as a black box (see also [12–15]). Another possible approach is to use a hybrid (filter and wrapper) method to try to optimize the efficiency of the feature selection process [15–18]. 2.1 The filter approach In the FSS filter approach, which is of special interest within the scope of this work, features are filtered regardless of the induction algorithm. In this approach, FSS is accomplished as a preprocessing step in which the effects of the selected features subset on the performance of the induction algorithm is completely ignored. For example, a simple decision tree algorithm can be used as an FSS filter to select features in large feature space for other inducers that take longer to search for their solution space. The set of features selected by the tree are the output of the filter FSS process and the tree itself is discarded. The remaining unused features are then deleted from the training set, reducing the training set dimension. Any other inducer can use this training set to extract a classifier. Still, features that are good for A tree-based algorithm for attribute selection decision trees are not necessarily useful for another family of algorithms that may have an entirely different inductive bias. Filtering algorithms can be grouped based on whether they evaluate the relevance of features individually or through feature subsets [19]. Algorithms in the first group assign some relevancy score to features individually and rank them based on their relevance to the target class concept. A feature is selected if its relevance is greater than a certain threshold. These algorithms can only capture the relevance of features w.r.t. the target concept, but cannot find redundancy among features. Two well-known algorithms that rely on individual relevance evaluation are Relief [20] and Gain Ratio [21]. Algorithms in the second group search through feature subsets, guided by some relevancy score computed for each subset. The subset is selected when the search stops. In this group, different algorithms are designed by changing the relevancy score as well as the search strategy. The algorithm CFS uses heuristic search and a correlation relevancy score [22]. The correlation score assumes that good feature subsets contain features highly correlated with the target class, yet uncorrelated with (not predictive of) each other. In the experiments reported in Section 4 algorithms from these two groups of filters were used: Gain Ratio, Relief-F, and CFS. As mentioned earlier, the main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm. However, an interesting feature about filters is that once a dataset is filtered it can be used and evaluated by several inducers and/or paradigms, thus saving computational time. The next section describes the filter approach proposed in this study. 3 A tree-based filter As mentioned, in general, fast filter algorithms evaluate each attribute individually for some degree of relevance related to the target concept class. Sometimes two or more attributes can be considered at a time but at a high computational cost [23]. Our approach differs from fast filter algorithms in the sense that a decision tree may be able to capture relationships among several attributes w.r.t. the class at a time. Besides that, inducing a decision tree is fast, which allows one to perform this process on high dimensional datasets commonly found in gene expression profiles, massive medical databases or text mining tasks. Our filter approach iteratively builds a decision tree, selects attributes appearing on that tree (based on a threshold from the first tree performance), and removes them from the training set. Repetition of these steps go on until (a) there are no more attributes left in the training set, (b) the induced decision tree is a leaf (which means no attributes can separate class concepts), or (c) the filter reaches a maximum number of iteration steps. In the end, the filter outputs the selected attributes. The idea behind using the performance as a threshold from the first tree is based on the wrapper heuristic where the simplest FSS uses the performance of some classifier as the relevancy score. In this sense, only good features — those with performance greater than a threshold value — are selected. A feature is considered good and will thus be selected if its weight of relevance is greater than a threshold value. Algorithm 1 shows the high-level code of our attribute selection approach, where N represents the number of instances in the training set, and xi and yi , i = 1, . . . , N, represent a vector containing the attribute values and the class label for instance i, respectively. A represents the number of attributes. 1. First, a bootstrap sample [24] from all instances is taken, creating the training set (Line 2). Instances that do not appear in the training set (Bag) are set apart as the test set, also known as the out-of-bag (OutOfBag) set (Line 3). 2. The first decision tree is induced by using Bag as the training set (Line 6 and its AUC value is computed from the out-of-bag set, multiplied by the Θ parameter, and then stored in the threshold θ (Line 7). In other words, the threshold θ is the percentage Θ of the AUC from the first tree, computed from the out-of-bag set. 3. Next, attributes are selected in the following way. At every iteration l, the AUC obtained by the decision tree Tl from the out-of-bag, AUC(Tl ,OutOfBag), is compared to the threshold θ, which selects or not attributes appearing on that tree. All attributes on the tree (AttrOnClassifier) are now removed from the training (Bag) (Line 13) and test (OutOfBag) sets (Line 14), and a new tree is grown (Line 16). As already mentioned, this process is repeated until (a) a leaf is induced. (b) all attributes have been used, or (c) the maximum number of steps L is reached (Line 17). Finally, all the selected attributes are returned (Line 18). Therefore, in the loop (Lines 8–17), if Θ = 0 then all attributes appearing on induced trees will be selected by the filter, despite AUC values. If Θ = 1 only attributes appearing on induced trees with AUC greater or equal to the first tree will be selected by the filter. J. A. Baranauskas et al. The filter approach proposed in this study can be seen as an extension of two previous studies [25, 26]. In [25] we have induced ten decision trees from a micro-array dataset (#3 at Appendix), at each iteration removing attributes that appeared on previous trees at each iteration. In the first three trees, their AUC values were 0.91, 0.68, and 0.94, respectively, indicating that the first tree does not always provide the best performance. At that time, we had not conceived Algorithm 1, but that experiment corresponds to setting Θ = 1 and L = 10 in Algorithm 1 nowadays. In [26], we conceived a preliminary version of Algorithm 1 without the parameter L (or equivalently, L = ∞). We also evaluated three Θ values (100%, 95%, and 75%). In general, the latter value produced worse results, sometimes significantly, than the original dataset. This former experiment motivated us to include the parameter L in the present study (for performance reasons) and to use Θ = 100% and Θ = 95%. Table 1 shows a running example of Algorithm 1 using Θ = 100%. Consider a dataset containing A = 10 attributes {a1 , a2 , . . . , a10 } and a class attribute. Assume that a Table 1 A running toy-example of Algorithm 1 for Θ = 100% and A = 10 attributes {a1 , . . . , a10 } decision tree containing attributes a1 , a5 and a9 and AUC = 90% is induced. Note all the trees induced with an AUC larger than or equal to θ = 90% will have attributes selected by Algorithm 1 in the next steps. The first iteration starts by analyzing the tree (T1 ) that has already been built and because its AUC(T1 ) = 90% it will have its attributes selected. Still in the first iteration (as well as in the subsequent iterations), the attributes appearing in on the first tree are removed and the second tree is grown. Assume now that this second tree T2 contains attributes a4 , a2 , a10 and a8 . The second iteration begins with analysis of the second tree, which has AUC(T2 ) = 75%, which is lower than θ = 90%. Therefore, Algorithm 1 will not select attributes appearing in this second tree. However, these attributes are removed from the dataset as before. The third tree T3 is then induced. This time, assume that the attributes a6 , a7 and a3 are within this tree. The third iteration starts and tests whether the tree has an AUC larger than θ = 90%. Assuming that in the third tree has an AUC(T3 ) = 95%, the attributes on T3 will be selected and then removed from the dataset. At the end of the third iteration, the fourth tree is induced, but all attributes had already been removed from the dataset. Therefore, the built tree is a leaf, and the stop criterion is achieved. The selected attributes {a1 , a5 , a9 , a6 , a7 , a3 } are now returned, in this order, as the filter output. In this example, the default maximum number of steps, L = 4, is never reached. Because each decision tree takes at most AN log2 (N) steps (the worst case where all attributes are continuous and have different values [27]), and since at most L decision trees are induced by Algorithm 1, its worst case is O(LAN log2 (N)). Using the default value for L = log2 (A), the worst case is O(A log2 (A)N log2 (N)). 4 Experimental setup We used 30 datasets, all of which represent real medical data, such as gene expressions, surveys, and diagnostics, to evaluate Algorithm 1. Appendix presents dataset descriptions. Because the number of attributes and instances on each dataset can influence results, we used the density metric D3 proposed by [28] to partition datasets into eight low-density (Density ≤ 1) and 22 high-density (Density Iteration Tree Attributes on Tree Tj AUC(Tj ) θ 1 2 3 T1 T2 T3 T4 {a1 , a5 , a9 } {a4 , a2 , a10 , a8 } {a6 , a7 , a3 } ∅ 90% 75% 95% End 90% 90% 90% Selected {a1 , a5 , a9 } {a1 , a5 , a9 } {a1 , a5 , a9 , a6 , a7 , a3 } A tree-based algorithm for attribute selection > 1) datasets. We computed the density is computed as Density  logA N+1 c+1 , where N represents the number of instances, A is the number of attributes, and c represents the number of classes. For each dataset, we evaluated three different aspects in the experiments performed by using the Weka machine learning library [29]: 1. Filter runtime. We computed the running time (in seconds) of each filter mentioned in Section 4.1 and transformed them into logarithmic decimal scale. Fig. 1 Evaluating filter impact over inducer’s performance Execution Time 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 CFS GainRatio ReliefF Θ = 1 Θ = 0.95 UT 0 CFS GainRatio ReliefF All Θ = 1 Θ = 0.95 UT CFS GainRatio ReliefF Low Density Θ = 1 Θ = 0.95 UT High Density Percentage of Selected Attributes 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 CFS GainRatio ReliefF All Θ = 1 Θ = 0.95 UT 0.0 CFS GainRatio ReliefF Θ = 1 Θ = 0.95 UT Low Density Fig. 2 Runtime (upper, logarithm decimal scale) and Percentage of Selected Attributes (lower) CFS GainRatio ReliefF Θ = 1 Θ = 0.95 High Density UT J. A. Baranauskas et al. 2. Filter compression capacity. The compression capacity can be defined as how the filter can compact a dataset, or as how many attributes the filter can remove from the original dataset, hopefully without removing significant information. For example, for an original database containing 100 attributes, the filter is said to have achieved a compressibility (compression capacity) of 75% when the original dataset has been passed through the filter t create a filtered dataset containing only 25 attributes as output. 3. Filter impact over inducer’s performance. Because filters ignore the effects of the selected feature subset on the performance of the induction algorithm, we analyzed how filtering in five inducers of different machine learning paradigms mentioned in Section 4.2 impacts performance. Ten-fold stratified cross-validation aided evaluation in all these three aspects; results were averaged. Specifically, for the third aspect, the baseline for comparisons is the AUC (Area under ROC curve) value obtained by the classifier induced (I ) with all attributes (no filtering) through ten-fold stratified cross-validation. For the filter impact over inducer (F + I ), we also used ten-fold stratified cross-validation, but the filter never saw each test fold, as shown on Fig. 1. In other words, the filter only sees nine folds as the full training set and finds an attribute subset. This subset is used to filter attributes from both the nine training folds as well as from the remaining test fold. The nine filtered training folds are then fed to one of the inducers mentioned in Section 4.2 and its accuracy is evaluated on the filtered test fold. Again, this process was repeated ten times and results were averaged. To analyze the results, we applied the Friedman test [30] considering a confidence level of 95%; the null hypothesis Table 2 Benjamini-Hochberg post-hoc Test (all / low-density / high-density datasets) assumes all algorithms have equal performance. In the case of null hypothesis rejection, we employed the BenjaminiHochberg post-hoc test [31] to detect any significant difference among algorithms. Tables in Section 5 show the result of the post-hoc test, where the symbol △ () indicates that the algorithm in the row is (significantly) better than the algorithm in the column; the symbol ▽ () indicates that the algorithm in the row is (significantly) worse than the algorithm in the column. The symbol ◦ indicates that there is no difference between the row and column whatsoever. 4.1 Filters We evaluated Algorithm 1 by using two Θ values, Θ = 1.00 and Θ = 0.95, and default values for L. In the results, we also incorporated attributes selected by a single and unique decision tree (which corresponds to set up parameters Θ = 1.00 and L = 1 in Algorithm 1) designated ‘UT’ (Unique Tree) filter hereafter. We implemented Algorithm 1 as a novel Weka class, and the method buildDecisionTree in Algorithm 1 uses the algorithm J48 [29], a Java implementation from C4.5 [27]. We have also used three additional filter algorithms, all of which employed Wekas’s default settings [29]: (i) CFS (Correlation-based on Feature Selection), which uses the correlation in subsets to assess the predictive ability of each attribute in the subset together with the degree of redundancy among the attributes. This filter considers a subset good if the attributes contained therein correlate well with the class and contain uncorrelated attributes in that subset [32]. (ii) Relief-F, whose basic idea of this filter is to choose a subset of instances randomly, calculate their nearest neighbors and adjust a weight vector to provide greater values to Execution time CFS Gain Ratio Relief-F Θ = 1.00 Θ = 0.95 UT CFS Gain Ratio Relief-F Θ = 1.00 Θ = 0.95 UT ◦/◦/◦ /▽/ ◦/◦/◦ // // ◦/◦/◦ // // ▽/▽/▽ ◦/◦/◦ // // ▽/▽/▽ ▽/▽/◦ ◦/◦/◦ △/△/▽ /△/ // /▽/ /▽/ ◦/◦/◦ %Selected Attributes CFS Gain Ratio Relief-F Θ = 1.00 Θ = 0.95 UT CFS ◦/◦/◦ Gain Ratio /△/ ◦/◦/◦ Relief-F // △/△/△ ◦/◦/◦ Θ = 1.00 /△/ ▽/▽/▽ // ◦/◦/◦ Θ = 0.95 /△/ ▽/▽/▽ // ◦/◦/◦ ◦/◦/◦ UT ▽/▽/△ // // /▽/ /▽/ ◦/◦/◦ Runtime and Percentage of Selected Attributes of Algorithm 1 (Θ = 1.00 and Θ = 0.95) and other filters (CFS, Gain Ratio, Relief-F, and UT) A tree-based algorithm for attribute selection AUC values (J48 inducer) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT J48 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT All J48 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT Low Density J48 High Density AUC values (IBk3 inducer) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT IBk3 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT All IBk3 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT Low Density IBk3 High Density AUC values (NB inducer) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT NB 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT All Fig. 3 AUC values on filter impact over inducer’s performance Low Density NB CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT High Density NB J. A. Baranauskas et al. AUC values (PART inducer) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT PART All CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT PART Low Density PART High Density AUC values (SMO inducer) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT All SMO 0.2 CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT SMO CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT Low Density SMO High Density Fig. 3 (continued) attributes that can differentiate an instance from their neighbors of different classes [33]. (iii) Gain Ratio, which uses the namesake metric to rank all attributes. It is based on the number of outcomes of each attribute [21]. 4.2 Inducers We used five different machine learning paradigms: (i) PART rule learning algorithm, (ii) decision-tree learning represented by the J48 algorithm, (iii) statistical learning using Naı̈ve Bayes (NB), (iv) support vector machines with Sequential Minimal Optimization (SMO), and (v) lazy instance-based learning using the IBk algorithm. We applied all of them in their default settings, except for IBk, where k = 3 and will be designated IBk3 hereafter. 5 Results & discussion Runtime According to the top boxplots in Fig. 2 and to Table 2, Gain Ratio gave the shortest time among all the filters, followed by CFS, UT, Θ = 1.00, and Θ = 0.95. Relief-F provided the worst time in all the analyzed cases. Although Algorithm 1 using Θ = 1.00 or Θ = 0.95 was significantly worse than UT in all the high-density datasets, the average running time of Algorithm 1 was 1.39%, 14.68%, and 7.60% slower than UT for all, low- and high-density datasets, respectively. Compression capacity According to the bottom boxplots in Fig. 2 and to Table 2, Relief-F selected 95.49%, 78.63%, and 98.45% of attributes for all, low-, and high-density A tree-based algorithm for attribute selection Table 3 Benjamini-Hochberg post-hoc Test (all / low-density / high-density datasets) for AUC values on filter impact over inducer’s performance CFS +J48 ◦/◦/◦ Gain Ratio +J48 △/△/▽ ◦/◦/◦ Relief-F +J48 △/△/▽ △/△/△ ◦/◦/◦ Θ = 1.00 +J48 △/△/△ ▽/▽/△ ▽/▽/△ ◦/◦/◦ Θ = 0.95 +J48 △/△/△ ▽/▽/△ ▽/▽/△ ◦/◦/◦ ◦/◦/◦ UT +J48 // /△/ /△/ /△/ /△/ ◦/◦/◦ AUC CFS +IBk3 Gain Ratio +IBk3 Relief-F +IBk3 Θ = 1.00 +IBk3 Θ = 0.95 +IBk3 UT +IBk3 IBk3 CFS+IBk3 Gain Ratio+IBk3 Relief-F+IBk3 Θ = 1.00+IBk3 Θ = 0.95+IBk3 UT+IBk3 IBk3 ◦/◦/◦ △/△/△ ◦/◦/◦ △/△/△ ◦/◦/◦ ◦/◦/◦ △/△/▽ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ △/△/▽ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ ◦/◦/◦ // /△/ /△/ // // ◦/◦/◦ △/△/△ ▽/▽/△ ▽/▽/△ △/△/△ △/△/△ /▽/ ◦/◦/◦ AUC CFS +NB Gain Ratio +NB Relief-F +NB Θ = 1.00 +NB Θ = 0.95 +NB UT +NB NB CFS+NB Gain Ratio+NB Relief-F+NB Θ = 1.00+NB Θ = 0.95+NB UT+NB NB ◦/◦/◦ △//△ ◦/◦/◦ △/△/△ ▽/▽/◦ ◦/◦/◦ △/△/△ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ △/△/△ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ ◦/◦/◦ // /△/ /△/ // // ◦/◦/◦ △/△/△ ▽/▽/▽ ▽/▽/▽ △/△/△ △/△/△ /▽/ ◦/◦/◦ AUC CFS +PART Gain Ratio +PART Relief-F +PART Θ = 1.00 +PART Θ = 0.95 +PART UT PART PART CFS+PART Gain Ratio+PART Relief-F+PART Θ = 1.00+PART Θ = 0.95+PART UT+PART PART ◦/◦/◦ △/△/▽ ◦/◦/◦ △/△/▽ ▽/△/▽ ◦/◦/◦ △/△/▽ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ △/△/▽ ▽/▽/▽ ▽/▽/▽ ◦/◦/◦ ◦/◦/◦ // /△/ /△/ /△/ /△/ ◦/◦/◦ △/△/△ △/△/△ △/△/△ △/△/△ △/△/△ /▽/ ◦/◦/◦ AUC CFS +SMO Gain Ratio +SMO Relief-F +SMO Θ = 1.00 +SMO Θ = 0.95 +SMO UT +SMO SMO CFS+SMO Gain Ratio+SMO Relief-F+SMO Θ = 1.00+SMO Θ = 0.95+SMO UT+SMO SMO ◦/◦/◦ ▽/▽/▽ ◦/◦/◦ ▽/▽/▽ △/◦/△ ◦/◦/◦ ▽/△/▽ △/△/△ △/△/△ ◦/◦/◦ ▽/△/▽ △/△/△ △/△/△ ◦/◦/◦ ◦/◦/◦ // // // // // ◦/◦/◦ ▽/▽/▽ △/▽/△ △/▽/△ ▽/▽/△ ▽/▽/△ // ◦/◦/◦ AUC CFS+J48 Gain Ratio+J48 Relief-F+J48 Θ = 1.00+J48 Θ = 0.95+J48 UT+J48 J48 J48 △/△/△ △/▽/△ △/▽/△ △/△/△ △/△/△ /▽/ ◦/◦/◦ Algorithm 1 (Θ = 1.00 and Θ = 0.95) and other filters (CFS, Gain Ratio, Relief-F, and UT) applied in five machine learning inducers (J48, IBk3, NB, PART, and SMO). The notation F + I indicates the dataset was filtered by using filter F then the inducer I was applied to the filtered dataset, and the AUC metric was measured; the notation I indicates the inducer was evaluated without any filter, as explained in Section 4 J. A. Baranauskas et al. Table 4 Median AUC improvement/reduction in filter impact over all inducers’ performance Filter CFS Gain ratio Relief-F Θ = 1.00 Θ = 0.95 UT Low-density High-density All 0.00 4.19 1.82 −6.83 1.80 0.61 −5.59 1.80 0.61 −0.62 1.80 0.61 −0.62 1.80 0.61 −19.88 −8.98 −10.30 Positive figures mean AUC improvements when using filters; negative figures mean AUC reduction. Figures are expressed as percentages datasets, on average. Relief-F was significantly worse than CFS, Θ = 1.00, Θ = 0.95, and UT in all three cases. Gain Ratio selected 87.44%, 87.80% and 49.55% of attributes in all, low-, and high-density datasets, respectively. CFS afforded the best average result (46.24%) and selected fewer attributes than UT (48.63%) for high-density datasets. For low-density datasets, the average percentage of attributes selected by CFS, Gain Ratio, Relief-F, Θ = 1.00, Θ = 0.95, and UT were 4.73%, 49.55%, 78.63%, 1.61%, 1.61%, and 0.14%, respectively. UT was significantly better than Gain Ratio and Relief-F in all three cases; UT was significantly better than Θ = 1.00 and Θ = 0.95 in all and high-density datasets, but not significantly better in low-density datasets. Impact over inducer’s performance Figure 3 and Table 3 show the AUC values for all datasets and filter settings. UT was always worse, in many cases significantly, than any other filter or even the inducer without any filter. No filter (except for UT) caused significant loss of accuracy when used in conjunction with an inducer. Except for CFS+NB which was significantly better than Gain Ratio+NB for low-density datasets, there were no significant differences between the filters in terms of AUC. However, previous research found that the attribute subset selected by different FSS algorithms were quite different [34, 35]. Thus, for data mining and knowledge extraction, applying multiple filters to a high-dimensional dataset seems to be interesting since this application may yield complementary views of the problem at hand. To understand the impact of the filter on the performance of the inducer, we analyzed the improvement or reduction in AUC values and compared the performance of the inducer without any filter, AUC(I ), to the performance of the inducer using filtered features, AUC(F + I ). This +I ) − 1. comparison was expressed as the ratio AUC(F AUC(I ) By using this ratio, improvements in AUC values are expressed as positive figures (filters increased the performance); reductions are expressed as negatives ones (filters decreased the performance). We used the average of these ratios to summarize the results presented on Table 4. The last column of this table shows that UT was the worst filter in all cases; comparing the results of this table with data in Table 3 (column UT+I ), the behavior of UT as a filter is clearly often significantly lower as compared to the use of the inducer I without this sort of filter. This degradation in performance allowed us to suggest that the UT filter should not be considered in real-world applications. Excluding the UT filter and considering the remaining filters in all datasets, CFS showed almost 2% performance gain on average versus almost 1% for Gain Ratio, Relief-F, Θ = 1.00, and Θ = 0.95. For high-density datasets, these gains almost doubled. On the other hand, for low-density datasets the performance degradation obtained for Θ = 1.00 and Θ = 0.95 (less than 1%) was similar the performance degradation obtained for the CFS filter (0%), but smaller than the performance degradation obtained for the Gain Ratio and Relief-F filters (greater than 5%). In summary results showed that the runtime of Algorithm 1 is very close to, but sometimes significantly slower than the runtime of a unique tree. The compression capacity of Algorithm 1 is also significantly worse than the compression capacity of inducing a unique tree. In contrast, Algorithm 1 performs significantly better than a unique decision tree. Algorithm 1 performs as well as other existing filters (except for UT). It is noteworthy that while Algorithm 1 uses trees to select attributes (its bias), these selected attributes do not degrade the performance of algorithms with different learning biases. 6 Conclusion In this paper, we proposed an iterative decision tree-based filter for feature subset selection. Although the proposed filter can use any inducer with embedded feature selection and any metric to determine whether selection of an attribute is desirable, we fixed J48 as the filter inducer and AUC as the selection metric. Using several medical datasets we evaluated our filter in terms of running time, compression capacity and performance over five machine learning paradigms. Overall our approach took as long as simpler filter that generates a single decision tree (UT) and uses these attributes, but it performed better and its performance was comparable to the performance of other filters. The compression capacity of our algorithm on 30 datasets was less than 80%, whereas the filters Gain Ratio A tree-based algorithm for attribute selection and Relief-F selected 87.44% and 95.49% of attributes, respectively; filter CFS selected 34.81% of attributes. However, considering high-dimensional datasets or, equivalently, low-density datasets, our algorithm selected less than 2% of attributes without harming performance. Considering performance, UT was always worse, in many cases significantly, than any other filter or even the inducer without any filter. Hence, we do not recommend the use of UT as a filter in daily practice. No filter (except for UT), including our approach, caused a significant loss of accuracy when used in conjunction with an inducer. This reinforces the fact that machine learning and data mining practitioners should consider these filters, particularly for large databases. Because attribute subsets selected by different FSS algorithms are generally quite distinct, our approach constitutes a good alternative for knowledge Table 5 Summary of the datasets used in the experiments # Dataset discovery in high-dimensional datasets, typically found in the medical, biomedical or biological domains. Our algorithm is also suitable for problem transformation and algorithm adaptation and has potential for use in low-density datasets. Acknowledgements This work was partially funded by a joint grant between the National Research Council of Brazil (CNPq), and the Amazon State Research Foundation (FAPEAM) through the Program National Institutes of Science and Technology, INCT ADAPTA Project (Centre for Studies of Adaptations of Aquatic Biota of the Amazon). We are thankful to Cynthia M. Campos Prado Manso for thoroughly reading the draft of this paper. Appendix: Datasets The experiments reported here used 30 datasets, all of them representing real medical data, such as gene expressions, N c A MISS Density 1 2 3 4 5 6 7 8 Lymphoma CNS Leukemia Leukemia nom. Colon Lung Cancer C. Arrhythmia Ecoli 96 60 72 72 62 32 452 482 9 2 2 2 2 3 16 13 4026 7129 7129 7129 2000 56 279 280 5.09% 0.00% 0.00% 0.00% 0.00% 0.28% 0.32% 1.07% 0.27 0.34 0.36 0.36 0.40 0.52 0.58 0.63 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Dermatology Lymphography HD Switz. Hepatitis P. Patient HD Hungarian HD Cleveland WDBC Splice Junction Thyroid 0387 Heart Statlog Allhyper Allhypo Breast Cancer Sick Hypothyroid ANN Thyroid WBC Liver Disorders Pima Diabetes C. Method H. Survival 366 148 123 155 90 294 303 569 3190 9172 270 3772 3772 286 3772 3163 7200 699 345 768 1473 306 6 4 5 2 3 5 5 2 3 32 2 5 4 2 2 2 3 2 2 2 3 2 34 18 13 19 8 13 13 30 60 29 13 29 29 9 29 25 21 9 6 8 9 3 0.06% 0.00% 17.07% 5.67% 0.42% 20.46% 0.18% 0.00% 0.00% 5.50% 0.00% 5.54% 5.54% 0.35% 5.54% 6.74% 0.00% 0.25% 0.00% 0.00% 0.00% 0.00% 1.12 1.17 1.18 1.34 1.50 1.52 1.53 1.54 1.63 1.67 1.76 1.91 1.97 2.08 2.12 2.16 2.46 2.48 2.65 2.67 2.69 4.21 N, A and c stand for the number of instances, number of attributes, and number of classes, respectively; MISS represents the percentage of attributes with missing values, not considering the class attribute. Datasets are in ascending order of Density J. A. Baranauskas et al. surveys, and diagnoses. The medical domain often imposes difficult obstacles to learning algorithms: high dimensionality, a huge or very small amount of instances, several possible class values, unbalanced classes, etc. This sort of data is indicated for filters, not only because of its large dimension but also because filters have a computational efficiency over wrappers [36]. Table 5 shows a summary of the datasets, none of which have missing values for the class attribute. Since the number of attributes and instances on each dataset can influence the results, we have used the density metric D3 proposed by [28] partitioning datasets into 8 lowdensity (Density ≤ 1) and 22 high-density (Density > 1) datasets. We computed density as: Density  logA N +1 c+1 where N represents the number of instances, A is the number of attributes, and c represents the number of classes. Next we provide a brief description of each dataset. Breast Cancer, Lung Cancer, CNS (Central Nervous System Tumour Outcome), Colon, Lymphoma, Leukemia, Leukemia nom., WBC (Wisconsin Breast Cancer), WDBC (Wisconsin Diagnostic Breast Cancer), Lymphography and H. Survival (H. stands for Haberman’s) are all related to cancer and their attributes consist of clinical, laboratory and gene expression data. Leukemia and Leukemia nom. represent the same data, but the second one had its attributes discretized [25]. C. Arrhythmia (C. stands for Cardiac), Heart Statlog, HD Cleveland, HD Hungarian and HD Switz. (Switz. stands for Switzerland) are related to heart diseases and their attributes represent clinical and laboratory data. Allhyper, Allhypo, ANN Thyroid, Hypothyroid, Sick and Thyroid 0387 are a series of datasets related to thyroid conditions. Hepatitis and Liver Disorders are related to liver diseases, whereas C. Method (C. stands for Contraceptive), Dermatology, Pima Diabetes (Pima Indians Diabetes) and P. Patient (P. stands for Postoperative) are other datasets related to human conditions. Splice Junction is related to the task of predicting boundaries between exons and introns. E.Coli is related to protein localization sites. Datasets were obtained from the UCI Repository [37], Leukemia and Leukemia nom. were obtained from [38]. References 1. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507 2. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996). In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) From data mining to knowledge discovery: an overview. American Association for Artificial Intelligence, Menlo Park, pp 1–30 3. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I et al (2006) Machine learning in bioinformatics. Brief Bioinform 7(1):86–112 4. Foithong S, Pinngern O, Attachoo B (2011) Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications 5. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan, Kaufmann 6. Ditzler G, Morrison J, Lan Y, Rosen G (2015) Fizzy: feature subset selection for metagenomics. BMC Biochem 16(1): 358. Available from: http://www.biomedcentral.com/1471-2105/ 16/358 7. Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSObased feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53(4):331–344. Available from: doi:10.1007/s11517-014-1238-7 8. Purkayastha P, Rallapalli A, Bhanu Murthy NL, Malapati A, Yogeeswari P, Sriram D (2015) Effect of feature selection on kinase classification models. In: Muppalaneni NB, Gunjan VK (eds) Computational intelligence in medical informatics springerbriefs in applied sciences and technology. Springer, Singapore, pp 81–86. Available from: doi:10.1007/978-981-287-26098 9. Devaraj S, Paulraj S (2015) An efficient feature subset selection algorithm for classification of multidimensional dataset. The Scientific World Journal. 2015. (Article ID 821798):9 p Available from: doi:10.1155/2015/821798 10. Govindan G, Nair AS (2014) Sequence features and subset selection technique for the prediction of protein trafficking phenomenon in Eukaryotic non membrane proteins. International Journal of Biomedical Data Mining 3(2):1–9. Available from: http://www.omicsonline.com/open-access/sequence-features-andsubset-selection-technique-for-the-prediction-of-protein-traffickingphenomenon-in-eukaryotic-non-membrane-proteins-2090-4924.100 0109.php?aid=39406 11. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. AI 97(1–2):245–271 12. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324. Relevance. Available from: http:// www.sciencedirect.com/science/article/pii/S000437029700043X 13. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1-3):389–422. Available from: doi:10.1023/A:1012487302797 14. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. Available from: http://dl.acm.org/citation.cfm?id=944919.944968 15. Özge Uncu, Tüşen IB (2007) A novel feature selection approach: Combining feature wrappers and filters. Inf Sci 177(2):449–466. Available from: http://www.sciencedirect.com/science/article/pii/ S0020025506000806 16. Min H, Fangfang W (2010) Filter-wrapper hybrid method on feature. In: 2010 2nd WRI global congress on selection intelligent systems (GCIS), vol 3. IEEE, pp 98–101 17. Lan Y, Ren H, Zhang Y, Yu H, Zhao X (2011) A hybrid feature selection method using both filter and wrapper in mammography CAD. In: Proceedings of the 2011 international conference on IEEE image analysis and signal processing (IASP), pp 378– 382 18. Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Transn Neural Netw 20(2):189–201 19. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Machine learning international conference, vol 20, p 856. Available from: http://www. public.asu.edu/∼huanliu/papers/icml03.pdf A tree-based algorithm for attribute selection 20. Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the 10th national conference on artificial intelligence. AAAI’92. AAAI Press, pp 129–134. Available from: http://dl.acm.org/citation.cfm? id=1867135.1867155 21. Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: McDonald C (ed) J Comput S ’98 Proceedings of the 21st Australasian computer science conference ACSC98, Perth, 4-6 February. Springer, Berlin, pp 181–191 22. Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; pp 359– 366. Available from: http://dl.acm.org/citation.cfm?id=645529. 657793 23. Gao K, Khoshgoftaar T, Van Hulse J (2010) An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd international florida artificial intelligence research society conference, pp 416–421 24. Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560 25. Netto OP, Nozawa SR, Mitrowsky RAR, Macedo AA, Baranauskas JA, Lins CUN (2010) Applying decision trees to gene expression data from DNA microarrays: a Leukemia case study. In: XXX congress of the Brazilian computer society, X workshop on medical informatics, p 10 26. Netto OP, Baranauskas JA (2012) An iterative decision tree threshold filter. In: XXXII congress of the Brazilian computer society, X workshop on medical informatics, p 10 27. Quinlan JR (1993) C4.5: Programs for Machine Learning. San Francisco 28. Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random forest? In: Proceedings of the 8th international 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. conference on machine learning and data mining in pattern recognition. MLDM’12. Springer-Verlag, Berlin Heidelberg, pp 154– 168. Available from: doi:10.1007/978-3-642-31537-4 13 Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan, Kaufmann Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92 Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300 Hall MA, Smith LA (1997) Feature subset selection: a correlation based filter approach. In: 1997 international conference on neural information processing and intelligent information systems. Springer, pp 855–858 Wang Y, Makedon F (2004) Application of Relief-F feature filtering algorithm to selecting informative genes for cancer classification using microarray data. In: Proceeding of the computational systems bioinformatics conference, 2004. CSB 2004, IEEE, pp 497–498 Baranauskas JA, Monard MC (1999) The MLL + + wrapper for feature subset selection using decision tree, production rule, instance based and statistical inducers: some experimental results. ICMC-USP vol 87 Available from: http://dcm.ffclrp.usp. br/augusto/publications/rt 87.pdf Lee HD, Monard MC, Baranauskas JA Empirical Comparison of Wrapper and Filter Approaches for Feature Subset Selection. ICMC-USP; 1999. 94. Available from: http://dcm.ffclrp.usp.br/ augusto/publications/rt 94.pdf Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley-IEEE Press, Wiley Frank A, Asuncion A (2010) UCI machine learning repository. Available from: http://archive.ics.uci.edu/ml Institute B (2010) Cancer program data sets. Available from: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi