Appl Intell
DOI 10.1007/s10489-017-1008-y
A tree-based algorithm for attribute selection
José Augusto Baranauskas1
Alessandra Alaniz Macedo1
· Oscar Picchi Netto1 · Sérgio Ricardo Nozawa2 ·
© Springer Science+Business Media, LLC 2017
Abstract This paper presents an improved version of a
decision tree-based filter algorithm for attribute selection.
This algorithm can be seen as a pre-processing step of
induction algorithms of machine learning and data mining tasks. The filter was evaluated based on thirty medical
datasets considering its execution time, data compression
ability and AUC (Area Under ROC Curve) performance. On
average, our filter was faster than Relief-F but slower than
both CFS and Gain Ratio. However for low-density (highdimensional) datasets, our approach selected less than 2%
of all attributes at the same time that it did not produce performance degradation during its further evaluation based on
five different machine learning algorithms.
Keywords Attribute selection · Filter · Decision tree ·
High dimensional data · Data pre-processing
José Augusto Baranauskas
[email protected]
Oscar Picchi Netto
[email protected]
Sérgio Ricardo Nozawa
[email protected]
Alessandra Alaniz Macedo
[email protected]
1
Department of Computer Science and Mathematics,
Faculty of Philosophy, Sciences and Languages at Ribeirao
Preto, University of Sao Paulo (USP), Av. Bandeirantes,
3900, Ribeirão Preto, SP, 14040-901, Brazil
2
Dow AgroSciences (Seeds, Traits, Oils),
Av. Antonio Diederichsen, 400, Ribeirão Preto,
SP, 14020-250, Brazil
1 Introduction
Data Mining (DM) is an interdisciplinary field that brings
together techniques from Machine Learning, statistics, pattern recognition, databases and visualization techniques
to address the issue of extracting high-level knowledge
from low-level data in large databases [1]. When Machine
Learning (ML) techniques are used for DM, where the
number of records (instances) is very large, several representative samples from the database are usually taken and
presented to an ML algorithm. Then, knowledge extracted
from these samples by ML algorithms is combined in some
way [2].
The exponential growth in the amount of available biological data raises two problems: efficient information storage and management and extraction of useful information
from this data [3]. Regarding the use of ML in DM, one
important issue to consider is reducing the dimensionality
of database records, which can be achieved by reducing
the number of records attributes (i.e. deleting columns on
tables in the database literature or features/attributes in the
Machine Learning literature). The data subset resulting from
these deletions maintains the same number of instances
but only a subset of features with predictive performance
comparable to the full set of features remain. This process of attribute elimination is known as the Feature Subset
Selection (FSS) problem, where one of the central issues is
selection of relevant features and/or elimination of irrelevant
ones.
Using the ML and DM algorithms is a strategy to extract
information more efficiently. However, when the amount of
data is huge, the use of an efficient FSS algorithm is sometimes essential not only to speed up algorithms but also
to reduce data that can be benchmark tested. This is why
J. A. Baranauskas et al.
FSS, initially an illustrative example, has become a real prerequisite for building models [1]. In the particular case of
medical or biological data analysis or even in text mining,
the amount of data is huge, and an FSS algorithm can help
to reduce it.
There are several reasons to conduct FSS. First, FSS
generally improves accuracy because many ML algorithms
perform poorly when given too many features. Second,
FSS may improve comprehensibility, which is the ability
of humans to understand the data and the classification
rules induced by symbolic ML algorithms, such as rules
and decision trees. Finally, FSS can reduce measurement
cost because measuring features may be expensive in some
domains. In this study, we present an approach to FSS that
employs decision trees within a filter algorithm [4].
This work is organized as follows: Section 2 presents
the basic concepts of the FSS problem; Section 3 describes
the algorithm proposed in this study; Section 4 shows the
experimental setup used to evaluate the proposed algorithm;
Section 5 shows the experiments and discusses the results;
and Section 6 presents the conclusion of this study.
2 Feature subset selection
Supervised learning is the process of automatically creating
a classification model from a set of instances (records or
examples) called the training set which belongs to a set of
classes. There are two aspects to consider in this process: the
features that should be used to describe the concept and the
combination of these features. Once a model (classifier) is
created, it can help to predict the class of other unclassified
examples automatically.
In other words, in supervised learning, an inducer is given
a set of N training examples containing A attributes. Each
example x is an element of the set F1 × F2 × . . . × FA ,
where Fj is the domain of the j th feature. Training examples are tuples (x, y) where y is the label, output or class.
The y values are typically drawn from a discrete set of
c classes {1, . . . , c} in the case of classification or from
the real values in the case of regression. In this work, we
will refer to classification. Given a set of training examples, the learning algorithm (inducer) outputs a classifier
such that, given a new instance, it accurately predicts the
label y.
One of the central problems in supervised learning is
selection of useful features. Although most learning methods attempt to either select features or assign them degrees
of importance, both theoretical analysis and experimental studies indicate that many algorithms scale poorly to
domains with large numbers of irrelevant features. For
example, the number of training cases that are necessary
for the simple nearest neighbor to reach a given level of
accuracy appears to grow exponentially with the number
of irrelevant features, independently of the target concept.
Even methods that induce univariate decision trees, which
explicitly select some attributes in favor of others, exhibit
this behavior for some target concepts. Some techniques,
like the Naı̈ve Bayes classifier, are robust with respect to
irrelevant features but they can be very sensitive to domains
with correlated features, even if they are relevant. Assuming that this sort of technique is related to independence
among features, additional methods might be necessary to
select a useful subset of features when many features are
available [5].
For instance, biological and medical domains often
impose difficult obstacles to learning algorithms such as
high dimensionality, a huge or very small amount of
instances, several possible class values, and unbalanced
classes. This may explain why researchers are still proposing a variety of algorithms although research on FSS is not
new in the ML community [6–10].
According to [11], approaches to feature selection developed in the research literature can be grouped into three
classes: (i) approaches that embed the selection within the
basic induction algorithm, (ii) approaches that use feature
selection to filter features during a pre-processing step while
ignoring the induction algorithm, and (iii) approaches that
treat feature selection as a wrapper around the induction
process, using the induction algorithm as a black box (see
also [12–15]). Another possible approach is to use a hybrid
(filter and wrapper) method to try to optimize the efficiency
of the feature selection process [15–18].
2.1 The filter approach
In the FSS filter approach, which is of special interest within
the scope of this work, features are filtered regardless of
the induction algorithm. In this approach, FSS is accomplished as a preprocessing step in which the effects of the
selected features subset on the performance of the induction algorithm is completely ignored. For example, a simple
decision tree algorithm can be used as an FSS filter to select
features in large feature space for other inducers that take
longer to search for their solution space. The set of features
selected by the tree are the output of the filter FSS process
and the tree itself is discarded. The remaining unused features are then deleted from the training set, reducing the
training set dimension. Any other inducer can use this training set to extract a classifier. Still, features that are good for
A tree-based algorithm for attribute selection
decision trees are not necessarily useful for another family
of algorithms that may have an entirely different inductive
bias.
Filtering algorithms can be grouped based on whether
they evaluate the relevance of features individually or
through feature subsets [19]. Algorithms in the first group
assign some relevancy score to features individually and
rank them based on their relevance to the target class concept. A feature is selected if its relevance is greater than a
certain threshold. These algorithms can only capture the relevance of features w.r.t. the target concept, but cannot find
redundancy among features. Two well-known algorithms
that rely on individual relevance evaluation are Relief [20]
and Gain Ratio [21].
Algorithms in the second group search through feature
subsets, guided by some relevancy score computed for each
subset. The subset is selected when the search stops. In this
group, different algorithms are designed by changing the
relevancy score as well as the search strategy. The algorithm
CFS uses heuristic search and a correlation relevancy score
[22]. The correlation score assumes that good feature subsets contain features highly correlated with the target class,
yet uncorrelated with (not predictive of) each other. In the
experiments reported in Section 4 algorithms from these two
groups of filters were used: Gain Ratio, Relief-F, and CFS.
As mentioned earlier, the main disadvantage of the filter
approach is that it totally ignores the effects of the selected
feature subset on the performance of the induction algorithm. However, an interesting feature about filters is that
once a dataset is filtered it can be used and evaluated by
several inducers and/or paradigms, thus saving computational time. The next section describes the filter approach
proposed in this study.
3 A tree-based filter
As mentioned, in general, fast filter algorithms evaluate
each attribute individually for some degree of relevance
related to the target concept class. Sometimes two or more
attributes can be considered at a time but at a high computational cost [23]. Our approach differs from fast filter
algorithms in the sense that a decision tree may be able
to capture relationships among several attributes w.r.t. the
class at a time. Besides that, inducing a decision tree is fast,
which allows one to perform this process on high dimensional datasets commonly found in gene expression profiles,
massive medical databases or text mining tasks.
Our filter approach iteratively builds a decision tree,
selects attributes appearing on that tree (based on a threshold
from the first tree performance), and removes them from the
training set. Repetition of these steps go on until (a) there
are no more attributes left in the training set, (b) the induced
decision tree is a leaf (which means no attributes can separate class concepts), or (c) the filter reaches a maximum
number of iteration steps. In the end, the filter outputs the
selected attributes.
The idea behind using the performance as a threshold
from the first tree is based on the wrapper heuristic where
the simplest FSS uses the performance of some classifier
as the relevancy score. In this sense, only good features —
those with performance greater than a threshold value —
are selected. A feature is considered good and will thus be
selected if its weight of relevance is greater than a threshold
value.
Algorithm 1 shows the high-level code of our attribute
selection approach, where N represents the number of
instances in the training set, and xi and yi , i = 1, . . . , N,
represent a vector containing the attribute values and the
class label for instance i, respectively. A represents the
number of attributes.
1. First, a bootstrap sample [24] from all instances is
taken, creating the training set (Line 2). Instances that
do not appear in the training set (Bag) are set apart as
the test set, also known as the out-of-bag (OutOfBag)
set (Line 3).
2. The first decision tree is induced by using Bag as the
training set (Line 6 and its AUC value is computed from
the out-of-bag set, multiplied by the Θ parameter, and
then stored in the threshold θ (Line 7). In other words,
the threshold θ is the percentage Θ of the AUC from
the first tree, computed from the out-of-bag set.
3. Next, attributes are selected in the following way. At
every iteration l, the AUC obtained by the decision
tree Tl from the out-of-bag, AUC(Tl ,OutOfBag), is
compared to the threshold θ, which selects or not
attributes appearing on that tree. All attributes on the
tree (AttrOnClassifier) are now removed from
the training (Bag) (Line 13) and test (OutOfBag)
sets (Line 14), and a new tree is grown (Line 16). As
already mentioned, this process is repeated until (a) a
leaf is induced. (b) all attributes have been used, or (c)
the maximum number of steps L is reached (Line 17).
Finally, all the selected attributes are returned (Line 18).
Therefore, in the loop (Lines 8–17), if Θ = 0 then all
attributes appearing on induced trees will be selected by the
filter, despite AUC values. If Θ = 1 only attributes appearing on induced trees with AUC greater or equal to the first
tree will be selected by the filter.
J. A. Baranauskas et al.
The filter approach proposed in this study can be seen as
an extension of two previous studies [25, 26]. In [25] we
have induced ten decision trees from a micro-array dataset
(#3 at Appendix), at each iteration removing attributes that
appeared on previous trees at each iteration. In the first three
trees, their AUC values were 0.91, 0.68, and 0.94, respectively, indicating that the first tree does not always provide
the best performance. At that time, we had not conceived
Algorithm 1, but that experiment corresponds to setting
Θ = 1 and L = 10 in Algorithm 1 nowadays. In [26], we
conceived a preliminary version of Algorithm 1 without the
parameter L (or equivalently, L = ∞). We also evaluated
three Θ values (100%, 95%, and 75%). In general, the latter value produced worse results, sometimes significantly,
than the original dataset. This former experiment motivated
us to include the parameter L in the present study (for
performance reasons) and to use Θ = 100% and Θ = 95%.
Table 1 shows a running example of Algorithm 1 using
Θ = 100%. Consider a dataset containing A = 10 attributes {a1 , a2 , . . . , a10 } and a class attribute. Assume that a
Table 1 A running
toy-example of Algorithm 1 for
Θ = 100% and A = 10
attributes {a1 , . . . , a10 }
decision tree containing attributes a1 , a5 and a9 and AUC
= 90% is induced. Note all the trees induced with an AUC
larger than or equal to θ = 90% will have attributes selected
by Algorithm 1 in the next steps.
The first iteration starts by analyzing the tree (T1 ) that
has already been built and because its AUC(T1 ) = 90% it
will have its attributes selected. Still in the first iteration (as
well as in the subsequent iterations), the attributes appearing in on the first tree are removed and the second tree is
grown.
Assume now that this second tree T2 contains attributes
a4 , a2 , a10 and a8 . The second iteration begins with analysis of the second tree, which has AUC(T2 ) = 75%, which
is lower than θ = 90%. Therefore, Algorithm 1 will not
select attributes appearing in this second tree. However,
these attributes are removed from the dataset as before.
The third tree T3 is then induced. This time, assume that
the attributes a6 , a7 and a3 are within this tree. The third iteration starts and tests whether the tree has an AUC larger than
θ = 90%. Assuming that in the third tree has an AUC(T3 )
= 95%, the attributes on T3 will be selected and then
removed from the dataset. At the end of the third iteration,
the fourth tree is induced, but all attributes had already been
removed from the dataset. Therefore, the built tree is a leaf,
and the stop criterion is achieved. The selected attributes
{a1 , a5 , a9 , a6 , a7 , a3 } are now returned, in this order, as the
filter output. In this example, the default maximum number
of steps, L = 4, is never reached.
Because each decision tree takes at most AN log2 (N)
steps (the worst case where all attributes are continuous
and have different values [27]), and since at most L decision trees are induced by Algorithm 1, its worst case
is O(LAN log2 (N)). Using the default value for L =
log2 (A), the worst case is O(A log2 (A)N log2 (N)).
4 Experimental setup
We used 30 datasets, all of which represent real medical
data, such as gene expressions, surveys, and diagnostics, to
evaluate Algorithm 1. Appendix presents dataset descriptions. Because the number of attributes and instances on
each dataset can influence results, we used the density metric D3 proposed by [28] to partition datasets into eight
low-density (Density ≤ 1) and 22 high-density (Density
Iteration
Tree
Attributes on Tree Tj
AUC(Tj )
θ
1
2
3
T1
T2
T3
T4
{a1 , a5 , a9 }
{a4 , a2 , a10 , a8 }
{a6 , a7 , a3 }
∅
90%
75%
95%
End
90%
90%
90%
Selected
{a1 , a5 , a9 }
{a1 , a5 , a9 }
{a1 , a5 , a9 , a6 , a7 , a3 }
A tree-based algorithm for attribute selection
> 1) datasets. We computed the density is computed as
Density logA N+1
c+1 , where N represents the number of
instances, A is the number of attributes, and c represents the
number of classes.
For each dataset, we evaluated three different aspects
in the experiments performed by using the Weka machine
learning library [29]:
1. Filter runtime. We computed the running time (in
seconds) of each filter mentioned in Section 4.1 and
transformed them into logarithmic decimal scale.
Fig. 1 Evaluating filter impact over inducer’s performance
Execution Time
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
CFS GainRatio ReliefF
Θ = 1 Θ = 0.95
UT
0
CFS GainRatio ReliefF
All
Θ = 1 Θ = 0.95
UT
CFS GainRatio ReliefF
Low Density
Θ = 1 Θ = 0.95
UT
High Density
Percentage of Selected Attributes
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
CFS GainRatio ReliefF
All
Θ = 1 Θ = 0.95
UT
0.0
CFS GainRatio ReliefF
Θ = 1 Θ = 0.95
UT
Low Density
Fig. 2 Runtime (upper, logarithm decimal scale) and Percentage of Selected Attributes (lower)
CFS GainRatio ReliefF
Θ = 1 Θ = 0.95
High Density
UT
J. A. Baranauskas et al.
2. Filter compression capacity. The compression capacity can be defined as how the filter can compact a
dataset, or as how many attributes the filter can remove
from the original dataset, hopefully without removing
significant information. For example, for an original
database containing 100 attributes, the filter is said to
have achieved a compressibility (compression capacity) of 75% when the original dataset has been passed
through the filter t create a filtered dataset containing
only 25 attributes as output.
3. Filter impact over inducer’s performance. Because
filters ignore the effects of the selected feature subset
on the performance of the induction algorithm, we analyzed how filtering in five inducers of different machine
learning paradigms mentioned in Section 4.2 impacts
performance.
Ten-fold stratified cross-validation aided evaluation in all
these three aspects; results were averaged. Specifically, for
the third aspect, the baseline for comparisons is the AUC
(Area under ROC curve) value obtained by the classifier
induced (I ) with all attributes (no filtering) through ten-fold
stratified cross-validation. For the filter impact over inducer
(F + I ), we also used ten-fold stratified cross-validation,
but the filter never saw each test fold, as shown on Fig. 1. In
other words, the filter only sees nine folds as the full training
set and finds an attribute subset. This subset is used to filter
attributes from both the nine training folds as well as from
the remaining test fold. The nine filtered training folds are
then fed to one of the inducers mentioned in Section 4.2 and
its accuracy is evaluated on the filtered test fold. Again, this
process was repeated ten times and results were averaged.
To analyze the results, we applied the Friedman test [30]
considering a confidence level of 95%; the null hypothesis
Table 2 Benjamini-Hochberg
post-hoc Test (all / low-density
/ high-density datasets)
assumes all algorithms have equal performance. In the case
of null hypothesis rejection, we employed the BenjaminiHochberg post-hoc test [31] to detect any significant difference among algorithms. Tables in Section 5 show the result
of the post-hoc test, where the symbol △ () indicates that
the algorithm in the row is (significantly) better than the
algorithm in the column; the symbol ▽ () indicates that the
algorithm in the row is (significantly) worse than the algorithm in the column. The symbol ◦ indicates that there is no
difference between the row and column whatsoever.
4.1 Filters
We evaluated Algorithm 1 by using two Θ values, Θ =
1.00 and Θ = 0.95, and default values for L. In the
results, we also incorporated attributes selected by a single and unique decision tree (which corresponds to set
up parameters Θ = 1.00 and L = 1 in Algorithm 1)
designated ‘UT’ (Unique Tree) filter hereafter. We implemented Algorithm 1 as a novel Weka class, and the method
buildDecisionTree in Algorithm 1 uses the algorithm
J48 [29], a Java implementation from C4.5 [27].
We have also used three additional filter algorithms, all
of which employed Wekas’s default settings [29]: (i) CFS
(Correlation-based on Feature Selection), which uses the
correlation in subsets to assess the predictive ability of each
attribute in the subset together with the degree of redundancy among the attributes. This filter considers a subset
good if the attributes contained therein correlate well with
the class and contain uncorrelated attributes in that subset
[32]. (ii) Relief-F, whose basic idea of this filter is to choose
a subset of instances randomly, calculate their nearest neighbors and adjust a weight vector to provide greater values to
Execution time
CFS
Gain Ratio
Relief-F
Θ = 1.00
Θ = 0.95
UT
CFS
Gain Ratio
Relief-F
Θ = 1.00
Θ = 0.95
UT
◦/◦/◦
/▽/
◦/◦/◦
//
//
◦/◦/◦
//
//
▽/▽/▽
◦/◦/◦
//
//
▽/▽/▽
▽/▽/◦
◦/◦/◦
△/△/▽
/△/
//
/▽/
/▽/
◦/◦/◦
%Selected Attributes
CFS
Gain Ratio
Relief-F
Θ = 1.00
Θ = 0.95
UT
CFS
◦/◦/◦
Gain Ratio
/△/
◦/◦/◦
Relief-F
//
△/△/△
◦/◦/◦
Θ = 1.00
/△/
▽/▽/▽
//
◦/◦/◦
Θ = 0.95
/△/
▽/▽/▽
//
◦/◦/◦
◦/◦/◦
UT
▽/▽/△
//
//
/▽/
/▽/
◦/◦/◦
Runtime and Percentage of Selected Attributes of Algorithm 1 (Θ = 1.00 and Θ = 0.95) and other filters
(CFS, Gain Ratio, Relief-F, and UT)
A tree-based algorithm for attribute selection
AUC values (J48 inducer)
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
J48
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
All
J48
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
Low Density
J48
High Density
AUC values (IBk3 inducer)
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
IBk3
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
All
IBk3
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
Low Density
IBk3
High Density
AUC values (NB inducer)
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
NB
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
All
Fig. 3 AUC values on filter impact over inducer’s performance
Low Density
NB
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
High Density
NB
J. A. Baranauskas et al.
AUC values (PART inducer)
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
PART
All
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
PART
Low Density
PART
High Density
AUC values (SMO inducer)
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
All
SMO
0.2
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
SMO
CFS GRatio ReliefF Θ = 1 Θ = 0.95 UT
Low Density
SMO
High Density
Fig. 3 (continued)
attributes that can differentiate an instance from their neighbors of different classes [33]. (iii) Gain Ratio, which uses
the namesake metric to rank all attributes. It is based on the
number of outcomes of each attribute [21].
4.2 Inducers
We used five different machine learning paradigms: (i)
PART rule learning algorithm, (ii) decision-tree learning
represented by the J48 algorithm, (iii) statistical learning using Naı̈ve Bayes (NB), (iv) support vector machines
with Sequential Minimal Optimization (SMO), and (v) lazy
instance-based learning using the IBk algorithm. We applied
all of them in their default settings, except for IBk, where
k = 3 and will be designated IBk3 hereafter.
5 Results & discussion
Runtime According to the top boxplots in Fig. 2 and to
Table 2, Gain Ratio gave the shortest time among all the
filters, followed by CFS, UT, Θ = 1.00, and Θ = 0.95.
Relief-F provided the worst time in all the analyzed cases.
Although Algorithm 1 using Θ = 1.00 or Θ = 0.95
was significantly worse than UT in all the high-density
datasets, the average running time of Algorithm 1 was
1.39%, 14.68%, and 7.60% slower than UT for all, low- and
high-density datasets, respectively.
Compression capacity According to the bottom boxplots
in Fig. 2 and to Table 2, Relief-F selected 95.49%, 78.63%,
and 98.45% of attributes for all, low-, and high-density
A tree-based algorithm for attribute selection
Table 3 Benjamini-Hochberg post-hoc Test (all / low-density / high-density datasets) for AUC values on filter impact over inducer’s performance
CFS
+J48
◦/◦/◦
Gain Ratio
+J48
△/△/▽
◦/◦/◦
Relief-F
+J48
△/△/▽
△/△/△
◦/◦/◦
Θ = 1.00
+J48
△/△/△
▽/▽/△
▽/▽/△
◦/◦/◦
Θ = 0.95
+J48
△/△/△
▽/▽/△
▽/▽/△
◦/◦/◦
◦/◦/◦
UT
+J48
//
/△/
/△/
/△/
/△/
◦/◦/◦
AUC
CFS
+IBk3
Gain Ratio
+IBk3
Relief-F
+IBk3
Θ = 1.00
+IBk3
Θ = 0.95
+IBk3
UT
+IBk3
IBk3
CFS+IBk3
Gain Ratio+IBk3
Relief-F+IBk3
Θ = 1.00+IBk3
Θ = 0.95+IBk3
UT+IBk3
IBk3
◦/◦/◦
△/△/△
◦/◦/◦
△/△/△
◦/◦/◦
◦/◦/◦
△/△/▽
▽/▽/▽
▽/▽/▽
◦/◦/◦
△/△/▽
▽/▽/▽
▽/▽/▽
◦/◦/◦
◦/◦/◦
//
/△/
/△/
//
//
◦/◦/◦
△/△/△
▽/▽/△
▽/▽/△
△/△/△
△/△/△
/▽/
◦/◦/◦
AUC
CFS
+NB
Gain Ratio
+NB
Relief-F
+NB
Θ = 1.00
+NB
Θ = 0.95
+NB
UT
+NB
NB
CFS+NB
Gain Ratio+NB
Relief-F+NB
Θ = 1.00+NB
Θ = 0.95+NB
UT+NB
NB
◦/◦/◦
△//△
◦/◦/◦
△/△/△
▽/▽/◦
◦/◦/◦
△/△/△
▽/▽/▽
▽/▽/▽
◦/◦/◦
△/△/△
▽/▽/▽
▽/▽/▽
◦/◦/◦
◦/◦/◦
//
/△/
/△/
//
//
◦/◦/◦
△/△/△
▽/▽/▽
▽/▽/▽
△/△/△
△/△/△
/▽/
◦/◦/◦
AUC
CFS
+PART
Gain Ratio
+PART
Relief-F
+PART
Θ = 1.00
+PART
Θ = 0.95
+PART
UT
PART
PART
CFS+PART
Gain Ratio+PART
Relief-F+PART
Θ = 1.00+PART
Θ = 0.95+PART
UT+PART
PART
◦/◦/◦
△/△/▽
◦/◦/◦
△/△/▽
▽/△/▽
◦/◦/◦
△/△/▽
▽/▽/▽
▽/▽/▽
◦/◦/◦
△/△/▽
▽/▽/▽
▽/▽/▽
◦/◦/◦
◦/◦/◦
//
/△/
/△/
/△/
/△/
◦/◦/◦
△/△/△
△/△/△
△/△/△
△/△/△
△/△/△
/▽/
◦/◦/◦
AUC
CFS
+SMO
Gain Ratio
+SMO
Relief-F
+SMO
Θ = 1.00
+SMO
Θ = 0.95
+SMO
UT
+SMO
SMO
CFS+SMO
Gain Ratio+SMO
Relief-F+SMO
Θ = 1.00+SMO
Θ = 0.95+SMO
UT+SMO
SMO
◦/◦/◦
▽/▽/▽
◦/◦/◦
▽/▽/▽
△/◦/△
◦/◦/◦
▽/△/▽
△/△/△
△/△/△
◦/◦/◦
▽/△/▽
△/△/△
△/△/△
◦/◦/◦
◦/◦/◦
//
//
//
//
//
◦/◦/◦
▽/▽/▽
△/▽/△
△/▽/△
▽/▽/△
▽/▽/△
//
◦/◦/◦
AUC
CFS+J48
Gain Ratio+J48
Relief-F+J48
Θ = 1.00+J48
Θ = 0.95+J48
UT+J48
J48
J48
△/△/△
△/▽/△
△/▽/△
△/△/△
△/△/△
/▽/
◦/◦/◦
Algorithm 1 (Θ = 1.00 and Θ = 0.95) and other filters (CFS, Gain Ratio, Relief-F, and UT) applied in five machine learning inducers (J48,
IBk3, NB, PART, and SMO). The notation F + I indicates the dataset was filtered by using filter F then the inducer I was applied to the filtered
dataset, and the AUC metric was measured; the notation I indicates the inducer was evaluated without any filter, as explained in Section 4
J. A. Baranauskas et al.
Table 4 Median AUC
improvement/reduction in filter
impact over all inducers’
performance
Filter
CFS
Gain ratio
Relief-F
Θ = 1.00
Θ = 0.95
UT
Low-density
High-density
All
0.00
4.19
1.82
−6.83
1.80
0.61
−5.59
1.80
0.61
−0.62
1.80
0.61
−0.62
1.80
0.61
−19.88
−8.98
−10.30
Positive figures mean AUC improvements when using filters; negative figures mean AUC reduction. Figures
are expressed as percentages
datasets, on average. Relief-F was significantly worse than
CFS, Θ = 1.00, Θ = 0.95, and UT in all three cases. Gain
Ratio selected 87.44%, 87.80% and 49.55% of attributes
in all, low-, and high-density datasets, respectively. CFS
afforded the best average result (46.24%) and selected fewer
attributes than UT (48.63%) for high-density datasets. For
low-density datasets, the average percentage of attributes
selected by CFS, Gain Ratio, Relief-F, Θ = 1.00, Θ =
0.95, and UT were 4.73%, 49.55%, 78.63%, 1.61%, 1.61%,
and 0.14%, respectively. UT was significantly better than
Gain Ratio and Relief-F in all three cases; UT was significantly better than Θ = 1.00 and Θ = 0.95 in all
and high-density datasets, but not significantly better in
low-density datasets.
Impact over inducer’s performance Figure 3 and Table 3
show the AUC values for all datasets and filter settings. UT
was always worse, in many cases significantly, than any
other filter or even the inducer without any filter. No filter
(except for UT) caused significant loss of accuracy when
used in conjunction with an inducer. Except for CFS+NB
which was significantly better than Gain Ratio+NB for
low-density datasets, there were no significant differences
between the filters in terms of AUC. However, previous
research found that the attribute subset selected by different FSS algorithms were quite different [34, 35]. Thus, for
data mining and knowledge extraction, applying multiple
filters to a high-dimensional dataset seems to be interesting
since this application may yield complementary views of the
problem at hand.
To understand the impact of the filter on the performance of the inducer, we analyzed the improvement or
reduction in AUC values and compared the performance of
the inducer without any filter, AUC(I ), to the performance
of the inducer using filtered features, AUC(F + I ). This
+I )
− 1.
comparison was expressed as the ratio AUC(F
AUC(I )
By using this ratio, improvements in AUC values are
expressed as positive figures (filters increased the performance); reductions are expressed as negatives ones (filters
decreased the performance). We used the average of these
ratios to summarize the results presented on Table 4. The
last column of this table shows that UT was the worst filter in all cases; comparing the results of this table with data
in Table 3 (column UT+I ), the behavior of UT as a filter
is clearly often significantly lower as compared to the use
of the inducer I without this sort of filter. This degradation in performance allowed us to suggest that the UT filter
should not be considered in real-world applications. Excluding the UT filter and considering the remaining filters in all
datasets, CFS showed almost 2% performance gain on average versus almost 1% for Gain Ratio, Relief-F, Θ = 1.00,
and Θ = 0.95. For high-density datasets, these gains almost
doubled. On the other hand, for low-density datasets the performance degradation obtained for Θ = 1.00 and Θ = 0.95
(less than 1%) was similar the performance degradation
obtained for the CFS filter (0%), but smaller than the performance degradation obtained for the Gain Ratio and Relief-F
filters (greater than 5%).
In summary results showed that the runtime of Algorithm 1 is very close to, but sometimes significantly slower
than the runtime of a unique tree. The compression capacity
of Algorithm 1 is also significantly worse than the compression capacity of inducing a unique tree. In contrast,
Algorithm 1 performs significantly better than a unique
decision tree. Algorithm 1 performs as well as other existing
filters (except for UT). It is noteworthy that while Algorithm 1 uses trees to select attributes (its bias), these selected
attributes do not degrade the performance of algorithms
with different learning biases.
6 Conclusion
In this paper, we proposed an iterative decision tree-based
filter for feature subset selection. Although the proposed
filter can use any inducer with embedded feature selection and any metric to determine whether selection of an
attribute is desirable, we fixed J48 as the filter inducer and
AUC as the selection metric. Using several medical datasets
we evaluated our filter in terms of running time, compression capacity and performance over five machine learning
paradigms.
Overall our approach took as long as simpler filter
that generates a single decision tree (UT) and uses these
attributes, but it performed better and its performance was
comparable to the performance of other filters.
The compression capacity of our algorithm on 30
datasets was less than 80%, whereas the filters Gain Ratio
A tree-based algorithm for attribute selection
and Relief-F selected 87.44% and 95.49% of attributes,
respectively; filter CFS selected 34.81% of attributes. However, considering high-dimensional datasets or, equivalently,
low-density datasets, our algorithm selected less than 2% of
attributes without harming performance.
Considering performance, UT was always worse, in
many cases significantly, than any other filter or even the
inducer without any filter. Hence, we do not recommend
the use of UT as a filter in daily practice. No filter (except
for UT), including our approach, caused a significant loss
of accuracy when used in conjunction with an inducer.
This reinforces the fact that machine learning and data
mining practitioners should consider these filters, particularly for large databases. Because attribute subsets selected
by different FSS algorithms are generally quite distinct,
our approach constitutes a good alternative for knowledge
Table 5 Summary of the
datasets used in the
experiments
#
Dataset
discovery in high-dimensional datasets, typically found in the
medical, biomedical or biological domains. Our algorithm
is also suitable for problem transformation and algorithm
adaptation and has potential for use in low-density datasets.
Acknowledgements This work was partially funded by a joint grant
between the National Research Council of Brazil (CNPq), and the
Amazon State Research Foundation (FAPEAM) through the Program
National Institutes of Science and Technology, INCT ADAPTA Project
(Centre for Studies of Adaptations of Aquatic Biota of the Amazon).
We are thankful to Cynthia M. Campos Prado Manso for thoroughly
reading the draft of this paper.
Appendix: Datasets
The experiments reported here used 30 datasets, all of them
representing real medical data, such as gene expressions,
N
c
A
MISS
Density
1
2
3
4
5
6
7
8
Lymphoma
CNS
Leukemia
Leukemia nom.
Colon
Lung Cancer
C. Arrhythmia
Ecoli
96
60
72
72
62
32
452
482
9
2
2
2
2
3
16
13
4026
7129
7129
7129
2000
56
279
280
5.09%
0.00%
0.00%
0.00%
0.00%
0.28%
0.32%
1.07%
0.27
0.34
0.36
0.36
0.40
0.52
0.58
0.63
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Dermatology
Lymphography
HD Switz.
Hepatitis
P. Patient
HD Hungarian
HD Cleveland
WDBC
Splice Junction
Thyroid 0387
Heart Statlog
Allhyper
Allhypo
Breast Cancer
Sick
Hypothyroid
ANN Thyroid
WBC
Liver Disorders
Pima Diabetes
C. Method
H. Survival
366
148
123
155
90
294
303
569
3190
9172
270
3772
3772
286
3772
3163
7200
699
345
768
1473
306
6
4
5
2
3
5
5
2
3
32
2
5
4
2
2
2
3
2
2
2
3
2
34
18
13
19
8
13
13
30
60
29
13
29
29
9
29
25
21
9
6
8
9
3
0.06%
0.00%
17.07%
5.67%
0.42%
20.46%
0.18%
0.00%
0.00%
5.50%
0.00%
5.54%
5.54%
0.35%
5.54%
6.74%
0.00%
0.25%
0.00%
0.00%
0.00%
0.00%
1.12
1.17
1.18
1.34
1.50
1.52
1.53
1.54
1.63
1.67
1.76
1.91
1.97
2.08
2.12
2.16
2.46
2.48
2.65
2.67
2.69
4.21
N, A and c stand for the number of instances, number of attributes, and number of classes, respectively;
MISS represents the percentage of attributes with missing values, not considering the class attribute. Datasets
are in ascending order of Density
J. A. Baranauskas et al.
surveys, and diagnoses. The medical domain often imposes
difficult obstacles to learning algorithms: high dimensionality, a huge or very small amount of instances, several
possible class values, unbalanced classes, etc. This sort of
data is indicated for filters, not only because of its large
dimension but also because filters have a computational efficiency over wrappers [36]. Table 5 shows a summary of the
datasets, none of which have missing values for the class
attribute.
Since the number of attributes and instances on each
dataset can influence the results, we have used the density
metric D3 proposed by [28] partitioning datasets into 8 lowdensity (Density ≤ 1) and 22 high-density (Density > 1)
datasets. We computed density as:
Density logA
N +1
c+1
where N represents the number of instances, A is the
number of attributes, and c represents the number of classes.
Next we provide a brief description of each dataset.
Breast Cancer, Lung Cancer, CNS (Central Nervous System
Tumour Outcome), Colon, Lymphoma, Leukemia, Leukemia
nom., WBC (Wisconsin Breast Cancer), WDBC (Wisconsin
Diagnostic Breast Cancer), Lymphography and H. Survival
(H. stands for Haberman’s) are all related to cancer and their
attributes consist of clinical, laboratory and gene expression data. Leukemia and Leukemia nom. represent the same
data, but the second one had its attributes discretized [25].
C. Arrhythmia (C. stands for Cardiac), Heart Statlog, HD
Cleveland, HD Hungarian and HD Switz. (Switz. stands for
Switzerland) are related to heart diseases and their attributes
represent clinical and laboratory data. Allhyper, Allhypo,
ANN Thyroid, Hypothyroid, Sick and Thyroid 0387 are a
series of datasets related to thyroid conditions. Hepatitis
and Liver Disorders are related to liver diseases, whereas C.
Method (C. stands for Contraceptive), Dermatology, Pima
Diabetes (Pima Indians Diabetes) and P. Patient (P. stands
for Postoperative) are other datasets related to human conditions. Splice Junction is related to the task of predicting
boundaries between exons and introns. E.Coli is related to
protein localization sites. Datasets were obtained from the
UCI Repository [37], Leukemia and Leukemia nom. were
obtained from [38].
References
1. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection
techniques in bioinformatics. Bioinformatics 23(19):2507
2. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996). In: Fayyad
UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) From
data mining to knowledge discovery: an overview. American
Association for Artificial Intelligence, Menlo Park, pp 1–30
3. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I
et al (2006) Machine learning in bioinformatics. Brief Bioinform
7(1):86–112
4. Foithong S, Pinngern O, Attachoo B (2011) Feature subset selection wrapper based on mutual information and rough sets. Expert
Systems with Applications
5. Han J, Kamber M, Pei J (2011) Data mining: concepts and
techniques. Morgan, Kaufmann
6. Ditzler G, Morrison J, Lan Y, Rosen G (2015) Fizzy: feature subset selection for metagenomics. BMC Biochem 16(1):
358. Available from: http://www.biomedcentral.com/1471-2105/
16/358
7. Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSObased feature subset selection into the general form of Chou’s
PseAAC. Med Biol Eng Comput 53(4):331–344. Available from:
doi:10.1007/s11517-014-1238-7
8. Purkayastha P, Rallapalli A, Bhanu Murthy NL, Malapati A,
Yogeeswari P, Sriram D (2015) Effect of feature selection
on kinase classification models. In: Muppalaneni NB, Gunjan
VK (eds) Computational intelligence in medical informatics
springerbriefs in applied sciences and technology. Springer, Singapore, pp 81–86. Available from: doi:10.1007/978-981-287-26098
9. Devaraj S, Paulraj S (2015) An efficient feature subset selection
algorithm for classification of multidimensional dataset. The Scientific World Journal. 2015. (Article ID 821798):9 p Available
from: doi:10.1155/2015/821798
10. Govindan G, Nair AS (2014) Sequence features and subset
selection technique for the prediction of protein trafficking phenomenon in Eukaryotic non membrane proteins. International
Journal of Biomedical Data Mining 3(2):1–9. Available from:
http://www.omicsonline.com/open-access/sequence-features-andsubset-selection-technique-for-the-prediction-of-protein-traffickingphenomenon-in-eukaryotic-non-membrane-proteins-2090-4924.100
0109.php?aid=39406
11. Blum AL, Langley P (1997) Selection of relevant features and
examples in machine learning. AI 97(1–2):245–271
12. Kohavi R, John GH (1997) Wrappers for feature subset selection.
Artif Intell 97(1–2):273–324. Relevance. Available from: http://
www.sciencedirect.com/science/article/pii/S000437029700043X
13. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for
cancer classification using support vector machines. Mach Learn
46(1-3):389–422. Available from: doi:10.1023/A:1012487302797
14. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. Available from:
http://dl.acm.org/citation.cfm?id=944919.944968
15. Özge Uncu, Tüşen IB (2007) A novel feature selection approach:
Combining feature wrappers and filters. Inf Sci 177(2):449–466.
Available from: http://www.sciencedirect.com/science/article/pii/
S0020025506000806
16. Min H, Fangfang W (2010) Filter-wrapper hybrid method on feature. In: 2010 2nd WRI global congress on selection intelligent
systems (GCIS), vol 3. IEEE, pp 98–101
17. Lan Y, Ren H, Zhang Y, Yu H, Zhao X (2011) A hybrid feature
selection method using both filter and wrapper in mammography CAD. In: Proceedings of the 2011 international conference
on IEEE image analysis and signal processing (IASP), pp 378–
382
18. Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized
mutual information feature selection. IEEE Transn Neural Netw
20(2):189–201
19. Yu L, Liu H (2003) Feature selection for high-dimensional data:
a fast correlation-based filter solution. In: Machine learning international conference, vol 20, p 856. Available from: http://www.
public.asu.edu/∼huanliu/papers/icml03.pdf
A tree-based algorithm for attribute selection
20. Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the 10th
national conference on artificial intelligence. AAAI’92. AAAI
Press, pp 129–134. Available from: http://dl.acm.org/citation.cfm?
id=1867135.1867155
21. Hall MA, Smith LA (1998) Practical feature subset selection for
machine learning. In: McDonald C (ed) J Comput S ’98 Proceedings of the 21st Australasian computer science conference
ACSC98, Perth, 4-6 February. Springer, Berlin, pp 181–191
22. Hall MA (2000) Correlation-based feature selection for discrete
and numeric class machine learning. In: Proceedings of the 17th
international conference on machine learning. ICML ’00. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; pp 359–
366. Available from: http://dl.acm.org/citation.cfm?id=645529.
657793
23. Gao K, Khoshgoftaar T, Van Hulse J (2010) An evaluation of
sampling on filter-based feature selection methods. In: Proceedings of the 23rd international florida artificial intelligence research
society conference, pp 416–421
24. Efron B, Tibshirani R (1997) Improvements on cross-validation:
the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560
25. Netto OP, Nozawa SR, Mitrowsky RAR, Macedo AA,
Baranauskas JA, Lins CUN (2010) Applying decision trees to gene
expression data from DNA microarrays: a Leukemia case study.
In: XXX congress of the Brazilian computer society, X workshop
on medical informatics, p 10
26. Netto OP, Baranauskas JA (2012) An iterative decision tree
threshold filter. In: XXXII congress of the Brazilian computer
society, X workshop on medical informatics, p 10
27. Quinlan JR (1993) C4.5: Programs for Machine Learning. San
Francisco
28. Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees
in a random forest? In: Proceedings of the 8th international
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
conference on machine learning and data mining in pattern recognition. MLDM’12. Springer-Verlag, Berlin Heidelberg, pp 154–
168. Available from: doi:10.1007/978-3-642-31537-4 13
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan, Kaufmann
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Benjamini Y, Hochberg Y (1995) Controlling the false discovery
rate: a practical and powerful approach to multiple testing. J R Stat
Soc Ser B 57:289–300
Hall MA, Smith LA (1997) Feature subset selection: a correlation based filter approach. In: 1997 international conference on
neural information processing and intelligent information systems.
Springer, pp 855–858
Wang Y, Makedon F (2004) Application of Relief-F feature
filtering algorithm to selecting informative genes for cancer classification using microarray data. In: Proceeding of the computational systems bioinformatics conference, 2004. CSB 2004, IEEE,
pp 497–498
Baranauskas JA, Monard MC (1999) The MLL + + wrapper for feature subset selection using decision tree, production
rule, instance based and statistical inducers: some experimental
results. ICMC-USP vol 87 Available from: http://dcm.ffclrp.usp.
br/augusto/publications/rt 87.pdf
Lee HD, Monard MC, Baranauskas JA Empirical Comparison
of Wrapper and Filter Approaches for Feature Subset Selection.
ICMC-USP; 1999. 94. Available from: http://dcm.ffclrp.usp.br/
augusto/publications/rt 94.pdf
Kantardzic M (2011) Data mining: concepts, models, methods,
and algorithms. Wiley-IEEE Press, Wiley
Frank A, Asuncion A (2010) UCI machine learning repository.
Available from: http://archive.ics.uci.edu/ml
Institute B (2010) Cancer program data sets. Available from:
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi