Medical Datasets
Medical Datasets
Medical Datasets
Computer Science Department, College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586,
Saudi Arabia
c Higher Education Press and Springer-Verlag Berlin Heidelberg 2016
Abstract The significance of the preprocessing stage in any mining community [1]. This, in part, is due to the societal
data mining task is well known. Before attempting medical significance of the subject and also to the computational chal-
data classification, characteristics of medical datasets, includ- lenges it presents [2].
ing noise, incompleteness, and the existence of multiple and Normally, there exists a dataset of historic data describ-
possibly irrelevant features, need to be addressed. In this pa- ing a particular medical disorder. Such datasets consist of pa-
per, we show that selecting the right combination of prepro- tients’ records relating to demographic, clinical and patho-
cessing methods has a considerable impact on the classifica- logical data, along with the results of medical investigations
tion potential of a dataset. The preprocessing operations con- that have been collected for the diagnosis and prognosis of
sidered include the discretization of numeric attributes, the a particular medical disorder. Modern medical screening and
selection of attribute subset(s), and the handling of missing diagnostic methods generate a high volume of heterogeneous
values. The classification is performed by an ant colony op- data. These data are continually accumulating. Thus, mining
timization algorithm as a case study. Experimental results on such data requires intelligent methods [3,4].
25 real-world medical datasets show that a significant relative Medical data classification (MDC) refers to learning clas-
improvement in predictive accuracy, exceeding 60% in some sification models from medical datasets and aims to improve
cases, is obtained. the quality of health care [5]. Medical data classification can
Keywords classification, ant colony optimization, medical be used for diagnosis and prognosis purposes. MDC is dif-
data classification, preprocessing, feature subset selection, ferent from medical classification, or medical coding, which
discretization is the process of assigning internationally endorsed classifi-
cation codes to each medical diagnosis and procedure (the
WHO Family of International Classifications1) ).
1 Introduction Medical data exhibit unique features including noise re-
Modern clinical information systems store an extensive sulting from human as well as systematic errors, missing
amount of data in medical databases. This encourages the ex- values and even sparseness [4]. Table 1 illustrates medi-
traction of useful knowledge from medical databases, provid- cal dataset examples. These datasets are obtained from the
ing valuable insight for medical decision support. A branch of University of California Irvine (UCI) repository of machine
data mining, known as medical data mining, is currently con- learning datasets2) . For example, some datasets, like derma-
sidered one of the most popular research subjects in the data tology, consist of different types of attributes. As another
example, the high dimensionality is a feature of the car-
Received May 24, 2015; accepted January 8, 2016 diac arrhythmia dataset. The thyroid dataset contains more
E-mail: [email protected] than 7 000 instances. The hepatitis dataset is imbalanced. The
1) http://www.who.int/classifications/en/
2) http://archive.ics.uci.edu/ml/datasets.html
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1083
percentage of missing values in the Hungarian heart dataset 2 highlights related work in the area. A discussion about
exceeds 20%. Due to this nature, Tanwani et al. [4] called for AntMiner+ as a classification algorithm from the family of
the classification of medical data as a separate domain. ant colony optimization (ACO) algorithms is presented in
Data preprocessing has a profound effect on the perfor- Section 3. Next, Section 4 describes the individualized tun-
mance of the learner. The classification potential of a dataset ing procedure. Experimental results are presented in Section
can be improved to a large extent by selecting the right com- 5 and discussed in Section 6. The paper is concluded in Sec-
bination of preprocessing methods. Preprocessing is espe- tion 7.
cially important for medical datasets due to their character-
istics. However, each dataset is different, and there is no pre- 2 Related work
processing method that is best across all datasets. Deciding
the best combination of preprocessing methods for a spe- Tanwani and Farooq [9–11] performed an extensive study
cific dataset is not possible without trial and comparisons. to present the challenges associated with biomedical data
Technology is advancing rapidly. The advent of various open- and approximate the classification potential of a biomedical
source libraries, like Weka [6] and KEEL [7], hosting an ex- dataset using a qualitative measure of this complexity. The
tensive set of off-the-shelf preprocessing methods, combined study concludes that the classification accuracy is found to be
with the leisure of standard formats like the attribute-relation dependent on the complexity of the biomedical dataset, not
file format (ARFF)3) and advances in computer hardware on the classifier choice. The number and type of attributes
technology, encourages integration of automatic tuning for have no noticeable effect on the classification accuracy, as
preprocessing operations into the data mining task for each compared to the quality of the attributes. It is shown that
dataset on an individual basis. The idea is suitable for off-line biomedical datasets are noisy and that noise is the dominant
applications. factor that affects the resulting classification accuracy. Only
high percentages of missing values severely degrade the clas-
In this research, we investigate the influence of individual-
sification accuracy. The study also shows that evolutionary
ized preprocessing on the classification of medical datasets,
algorithms tend to overfit for small-sized datasets and are not
including the removal of missing values and a variety of dis-
much affected by the class imbalance problem.
cretization and attribute selection methods. Experiments were
conducted on 25 real-world medical datasets from the UCI The quality of data has a large implication for the qual-
machine learning repository. Datasets are then classified by ity of the mining results. It is necessary to perform a pre-
means of the AntMiner+ algorithm [8]. Numerical results processing step in order to remove or at least alleviate some
show that there is a significant improvement in classification of the problems associated with medical data. Depending on
performance, as measured by predictive accuracy, with rel- the characteristics of the data themselves, many preprocess-
ative improvement exceeding 60% in some cases, obtained ing techniques are pertinent. In this section, methodologies
in the majority of datasets in the benchmark, through the in- for dimensionality reduction (Section 2.1), discretization al-
dividualized tuning of the preprocessing operations. More- gorithms (Section 2.2), and handling missing values (Section
over, given a certain classification algorithm, the design of the 2.3) are described.
preprocessing stage can mean the difference between com-
2.1 Dimensionality reduction
plete failure and the achievement of good results on the same
datasets. As medical datasets are generally characterized as having
The rest of the paper is organized as follows. Section high dimensionality, this section introduces techniques com-
3) http://weka.wikispaces.com/ARFF
1084 Front. Comput. Sci., 2016, 10(6): 1082–1102
mon for dealing with this problem. There are two common di- rithm, as a pre-processing step. The selection employs mea-
mensionality reduction techniques: feature construction (FC) sures that utilize intrinsic characteristics of the data to deter-
and feature subset selection (FSS). FC is a transformation mine the relevance between the features and the class. These
technique that constructs a new, reduced feature space with a measures evaluate the class separability. Among the popu-
strong discriminative power but also loses the original feature lar measures applied are distance [22], consistency [23] and
characteristics, which leads to difficulties in the interpretation correlation measures [24,25]. All these measures rely on the
of the resulting models. FSS finds the minimum subset of fea- actual values of the training dataset and are thus sensitive to
tures that are useful for the classification process. Although noise and outlier values [12]. Information measures [26–28]
several features are discarded, this method preserves the orig- are popular as well.
inal physical interpretation of features. Further, in medical • Model-based approach The model-based approach for FSS
diagnosis, it is desirable to select the clinical tests that have applies a specific learning algorithm and uses its predictive
the least cost and risk and that are significantly important for accuracy as a measure of the subset effectiveness. Model-
determining the class of the disease. based methods fall into two types [29]. Wrappers [30,31] use
The first step is the subset search step. A search engine the learning machine as a black box. Embedded methods in-
is used to generate candidate feature subsets for evaluation corporate the feature subset selection process as a part of the
based on a certain search strategy. An exhaustive search strat- training process, such as that in genetic programming [32]
egy is practically prohibitive unless the number of variables is and decision tree algorithms like CART [33].
small. Greedy and heuristic methods are more efficient in this Hybrid methods exist as well. For example, as an efficient
respect [12] and include sequential forward selection (SFS) method, filters can be used initially to eliminate definitely re-
[13] and sequential backward selection (SBS) [14]. Hybrid dundant features, thus reducing the search space, and then a
forms of SFS and SBS have been devised to reduce the wrapper can be used for the search in the second phase [34].
nesting effect, such as sequential forward floating selection
(SFFS) and sequential backward floating selection (SBFS) 2.2 Discretization algorithms
[15]. However, all of these methods provide a suboptimal so-
lution [16]. Some learning algorithms cannot deal directly with numeric
Due to their well-known efficiency in combinatorial op- attributes. Discretization is thus an essential step to transform
timization problems, optimization-based search methods are these attributes into a form that can be handled. The numeric
also applied to FSS. The use of metaheuristics along with domain is partitioned into a finite number of non-overlapping
their hybrids is well established in the FSS research [16–20]. intervals. An association is then established between each nu-
Several feature subset selection techniques do not employ meric value that belongs to the attribute and the interval to
any search strategy. Individual features are ranked (weighted) which it belongs. As a side effect, discretization works as
according to their relevance, independently of others’ con- a data reduction and simplification method because it con-
text. Feature ranking is not used to build predictors but is a verts the huge numeric domain into a much smaller subset
simple, scalable approach, with good empirical success [21], of nominal intervals [35]. Finding the optimal discretization
which can be used as a baseline method to construct nested of a numeric attribute is NP-hard [36]. A recent survey [37]
subsets for building predictors. Similar to filters, correlation lists over 80 discretization methods that belong to 33 different
and information theoretic measures can be used in feature categories of the discretization taxonomy.
weighing and ranking. Another approach is to use single- If the class information is considered along with the eval-
variable classifiers, which rank variables according to their uation measure during discretization, then it is called a su-
discriminative power when used individually to build a clas- pervised discretization method. Most available discretization
sifier. methods belong to the supervised discretization type. The
Next, the candidate feature subset is evaluated in the subset class label is not considered in unsupervised methods. Equal-
evaluation step. The evaluation criterion is essential in deter- Width and EqualFrequency [38] discretizers are examples of
mining the best candidate subset. FSS methods can be clas- unsupervised discretization methods.
sified into two approaches based on their dependency on the Simple discretization methods do not use any evaluation
learning model: model-free and model-based approaches. measure to decide cut points. A predefined number of bins
• Model-free approach In the model-free (filter) approach, n is established and the domain is partitioned into n equal-
features are selected independently of the classification algo- width intervals (EqualWidth) or into n intervals that contain
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1085
and a default rule related to the majority class. In effect, rule proving the results. Otherwise, some instances would be re-
induction focuses on classes other than the majority class. moved because they include missing values in attributes that
This particular strategy is advantageous in MDC because the will be next removed by the attribute selection step. Thus,
majority of class instances are normally the negative cases of the removal of these instances is no longer rationalized. We
which we care less. The sequential-covering strategy helps in hypothesize that if the removal of instances with missing val-
handling large-sized datasets; due to the removal of instances ues were delayed until after the attribute selection step, then
already covered by induced rules, the progressive reduction better results would be obtained.
of the training set size is thus achieved.
AntMiner+ algorithm cannot handle instances containing 4.2 Discretization method
missing values. Thus, these instances are removed from the
Different discretization methods exist, but none can prove to
dataset in the first step. To reduce the size of the solution
be the best across all problems and learners [37]. When deal-
space, the number of attributes is limited to no more than a
ing with a specific problem or dataset, the choice of the dis-
default value of 10. If the dataset contains a larger number of
cretization method has a considerable effect on the classifi-
attributes, then attribute selection takes place prior to induc-
cation results in terms of both predictive accuracy and model
tion.
simplicity.
Various attribute types can be handled by the AntMiner+
Four discretization methods are selected for discretization
algorithm. These include nominal and ordinal values, as well
tuning. All of them are classified as static, univariate, and
as numeric values, including integer and continuous attributes
splitting methods. A brief description of each, along with its
that are discretized. In effect, numeric values are encoded as
acronym used, is presented next.
discrete intervals defined by [lower_bound − upper_bound].
The order of preprocessing steps in the concerned • Fayyad and Irani discretizer (fay)
AntMiner+ implementation is as follows: The Fayyad and Irani discretizer [26] is one of the most
popular discretizers that obtains a reasonable balance
1) removal of instances with missing values;
between the number of intervals and the accuracy ob-
2) discretization; tained [37]. It is based on a supervised method that uses
3) attribute selection. an entropy measure to decide split points. Stopping is
based on a minimum description length (MDL) [60] cri-
The AntMiner+ configuration is tuned for the following as- teria that explains the attractive balance between model
pects: complexity (number of intervals in this case) and per-
formance (accuracy).
1) timing of removing instances having missing values
(Section 4.1); • Kononenko’s MDL discretizer (kon)
This is similar to the Fayyad and Irani discretizer,
2) discretization algorithm (Section 4.2);
but it uses the Kononenko’s MDL criterion [61]. The
3) feature subset selection algorithm (Section 4.3);
Kononenko’s MDL criterion has a lower bias in han-
4) rule evaluation function (Section 4.4). dling multi-valued attributes and multi-class problems.
• EqualWidth discretizer (eib)
4.1 Timing of removing instances having missing values
EqualWidth [38], or equal interval binning (eib), par-
In the context of the AntMiner+ algorithm, all instances hav- titions the continuous domain into a predefined num-
ing missing values are removed in the first step of prepro- ber of equal-width bins. For each dataset, a number of
cessing. The next steps in the preprocessing consist of the 5, 10, 15, and 20 intervals are examined. The resulting
application of the discretization algorithm and attribute se- models are referred to as eib5, eib10, eib15, and eib20,
lection algorithm (if necessary). This procedure might not be respectively. EqualWidth is an unsupervised discretiza-
the best in some cases. For example, consider datasets with tion method.
a large number of predictive attributes. If the removal of in- • EqualFrequency discretizer (efb)
stances having missing values is delayed after the attribute EqualFrequency [38], or equal frequency binning (efb),
selection step, then this would allow more instances to be partitions the continuous domain into a predefined num-
available for training and testing subsets, thus perhaps im- ber of intervals such that the intervals have an equal
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1087
number of values. Similar to eib, for each dataset, a The (inf) method evaluates each attribute individually accord-
number of 5, 10, 15, and 20 intervals are examined. The ing to its measured information gain with respect to class.
resulting models are referred to as efb5, efb10, efb15, The (chi) method uses the chi-squared statistical measure to
and efb20, respectively. Also similar to eib, efb is an assess the degree of independence between the attribute and
unsupervised discretization method. the associated class. The (chi) method works on categorical
attributes.
4.3 Feature subset selection method A model-based evaluation is also included (1R). Each at-
tribute is evaluated individually by using the simple OneR
FSS (or attribute selection) process is described relatively to
classifier [67]. The (1R) method generates a single rule for
four aspects: subset search, subset evaluation, halting crite-
each attribute and ranks attributes according to the error rate
ria, and result validation.
associated with these rules.
Following is a list of the considered FSS methods. Next,
• Halting criterion
for each method we explain the strategy used for subset
search, subset evaluation, halting criteria, and result valida- Datasets are examined as follows. For model-free meth-
tion. ods, we use a default number of attributes (10) to retain, as
recommended by Minnaert et al. [58]. The best-first search in
1) ReliefF attribute evaluation (rel) [62,63]. cfs and con terminates when a default number (5) is reached
2) Correlation-based feature subset selection (cfs) [64]. for non-improving consecutive nodes. Finally, (1R) requires
no halting criteria as it generates a single rule per attribute
3) Consistency subset evaluation (con) [65].
and then performs the ranking.
4) Chi-squared attribute evaluation (chi).
• Result validation
5) Gain ration attribute evaluation (gai).
The (1R) model-based method uses a 10-fold cross vali-
6) Information gain attribute evaluation (inf). dation procedure. The inclusion of the no attribute selection
7) OneR attribute evaluation (1R). method is applied for baseline comparisons.
8) Symmetrical uncertainty attribute evaluation (sym)
4.4 Rule evaluation function
[66].
9) No attribute selection employed (0AS). A rule evaluation function maps a rule r into a fitness value
Q+ (r) that quantifies the quality of r. The higher Q+ (r) is, the
• Subset search better quality of r is. In the AntMiner+ algorithm, pheromone
The subset search methods described in Section 2.1 can be is reinforced on the best ant’s path proportionally to the asso-
grossly classified into exhaustive, exact, greedy, and heuris- ciated quality of the resulting rule Q+ (r).
tic methods. In addition, there is the simple feature-ranking
Instances that belong to the current target class are referred
method. Exhaustive methods are computationally intractable.
to as positive instances. Those that belong to other classes
Heuristic methods require high computational time, which is
are referred to as negative instances. Let the number of cor-
not suitable in combination with AntMiner+ . For (cfs) and
rectly classified positive instances be denoted TP, the number
(con), the best first search is used with a backward search
of positive instances incorrectly classified into negative FN.
direction. The rest of the methods use the simple ranking
Similarly, the number of negative instances correctly classi-
of attributes according to the attribute evaluation used [21].
fied as negative TN and those falsely classified into positive
Ranking is chosen due to its scalability, simplicity, and per-
as FP. A tradeoff has to be established between TP and FP
formance [29].
so that the coverage is maximized. Also, let D+ denote the
• Subset evaluation total number of positive instances remaining in the training
The methods selected include both model-free and model- dataset. Similarity, let D− denote the total number of negative
based subset evaluation. As for the model-free evalua- instances remaining in the training dataset. Let D+ + D− 0.
tion methods, the selection includes distance-based (rel),
consistency-based (con), correlation-based (cfs), and infor- • Klösgen (K) measure
mation theoretic-based evaluation measures (chi, gai, and The Klösgen measure [68] balances the tradeoff be-
inf). Attributes are individually evaluated using ReliefF (rel). tween precision and coverage. The parameter ω con-
The (gai) method measures gain ratio with respect to class. trols the weight assigned to coverage Eq. (1), (TP +
1088 Front. Comput. Sci., 2016, 10(6): 1082–1102
induction. Rule evaluation is not considered a preprocessing 11) Heart disease Swiss dataset (h_swiss).
step but is in the core of the rule induction process. Its tuning 12) Hepatitis (hep).
should take place after the preprocessing tuning.
13) Horse colic (horse).
The stratified 10-time, 10-fold cross-validation procedure 14) Thyroid disease hypothyroid dataset (hypo).
is used as it is considered the best error estimation strategy
15) Liver disorders (liver).
[71,72].
16) Breast cancer (ljub).
Further, to compare the performance of the different mod-
els examined on a solid basis, it is necessary to use statistical 17) Lymphography (lymph).
tests. In recent studies, the use of non-parametric statistical 18) Mammographic mass (mammo).
tests is highly recommended when dealing with evolutionary 19) Thyroid disease new thyroid dataset (new_thy).
computation algorithms [73]. The Wilcoxon signed-ranks test
20) Parkinson’s disease (park).
[74] is used for pairwise model analysis. The Friedman test
21) Pima Indian diabetes (pima).
[75] is used for multiple comparison tests. Datasets having
statistically significant difference among their different mod- 22) Primary tumor (p_tumor).
els are marked with an asterisk (*). According to these tests, 23) Thyroid disease sick dataset (sick).
the winner with a significance level α = 0.05 is stressed in 24) Breast cancer Wisconsin diagnostic (wdbc).
bold typeface. The first model in ranking is selected as (one
25) Breast cancer Wisconsin prognostic (wpbc).
of) the best learning models. If the aim is only to locate the
best model, then the procedure is as follows. If the difference The benchmark used hosts a wide variety of the charac-
is found statistically significant, then this means that at least teristics listed above. A summary of the main characteristics
one of the models included in the comparison is significantly is presented in Table 3. For each dataset, the number of in-
higher (or lower) than the rest. Nevertheless, the top-ranked stances (#Inst.), number of attributes (#Attr.) including nu-
model is selected in most cases as it represents the (or one of meric (#Num.) and nominal (#Nom.) attributes, and number
the) best performing models. In some cases, several models of classes (#Class.) are listed. Also included is the percentage
are equally good. That is, the difference among their perfor- of overall missing values (%MV) computed as ( #missing values
#Inst. ×#Attr. ×
mance is not considered statistically significant. In this case, 100) and the percentage of instances with missing values
other factors may be considered as will be explained in each (%Inst.MV) computed as ( #inst. with#Inst.
missing values
× 100).
case. As for class imbalance, and as far as we are concerned,
there is no consensus in the literature on a quantitative mea-
5.1 Benchmark of the experimentations sure for describing a dataset as imbalanced. The majority of
studies either limit the domain to binary classification prob-
In this research, we use 25 medical datasets obtained from
lems, or compute the imbalance ratio as the cardinality of the
the UCI machine learning repository [76]. The list of medi-
minority class over the total number of instances, ignoring
cal datasets in the benchmark, together with the abbreviations
other classes that may be present in the dataset (for exam-
used hereafter, are demonstrated as follows.
ple, Refs. [77,78]). The last two columns in Table 3 report
1) Cardiac arrhythmia (arr). the class noise (Noise) and imbalance ratio (Imb.Ratio) as re-
ported in Ref. [9]. The imbalance ratio recommended by in
2) Breast cancer Wisconsin original wbcd (bcw).
this study accounts for all classes in the dataset. For those
3) Contraceptive method choice (cmc). datasets that were not reported in Ref. [9], a dash (—) is
4) Dermatology (derma). placed.
5) Echocardiogram (echo).
5.2 Timing of removing instances having missing values
6) Ecoli (ecoli).
7) Haberman’s survival (haber). Referring to Table 3, among the 25 datasets in the benchmark,
15 datasets contain missing values. To test the hypothesis,
8) Heart disease Cleveland dataset (h_c).
we conduct the following experiment. We modify AntMiner+
9) Heart disease Hungarian dataset (h_h). such that the removal of instances having missing values is
10) Statlog heart (h_stat). delayed after the attribute selection step. We call this version
1090 Front. Comput. Sci., 2016, 10(6): 1082–1102
AntMiner+ with remove missing last (AM + RML). The aim decreasing the percentage of instances with missing values in
of this experiment is to decide which version of AntMiner+ the remaining attributes post the attribute selection phase.
(AM + Original or AM + RML) is more suitable to each dataset
considered. Results are compared and conclusions are based
on the following strategy. If AM + Original has failed, or there
is not a notable difference in the average predictive accuracy,
then AM + RML is used.
• Results
The results of this experiment are summarized in Fig. 1
and Table 4. For each AntMiner+ model, only the predictive
accuracy is shown in the comparison (Acc. ± STD). In the
first observation from Fig. 1, we note that the AM + Original
model fails in five datasets. The reason for failing is that
there are not enough instances to generate any folds. In all
of these datasets, we note that the percentage of instances Fig. 1 Average predictive accuracy for the timing of removal of instances
containing missing values is very high (98.90%–100.00%). having missing values experiment
The AM + RML model produced output for all five datasets.
However, for those with a relatively low number of attributes For the remaining 10 datasets, we note that the AM + RML
(h_h and h_swiss), the results are considerably poor and pro- model achieved a significant win over AM + Original in the
duced empty rules in several folds. As for the remaining three arr dataset. This is despite the fact that the class noise associ-
datasets (horse, hypo, and sick), results are much better. The ated with this dataset is not low (11.28%). The AM + Original
large number of associated attributes (22–29) has helped in obtains a significant win in one dataset: derma. For this par-
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1091
ticular dataset, the percentage of class noise is not signifi- Table 5 Datasets included in discretization method experiment
cant. It is expected that the attribute noise is high, which No. Dataset AntMiner+ version No. Dataset AntMiner+ version
only adds more noise when adding more instances for this 1 arr AM + RML 12 horse AM + RML
2 cmc AM + Original 13 hypo AM + RML
dataset, thus deteriorating classification performance. For the +
3 derma AM Original 14 liver AM + Original
rest of the datasets, the difference in performance among the
4 echo AM + RML 15 mammo AM + RML
AM + Original and the AM + RML models is not considered 5 ecoli AM + Original 16 new_thy AM + Original
statistically significant. Table 4 reports the average accuracy 6 haber +
AM Original 17 park AM + Original
(Acc), number of rules (Rules), number of terms per rule 7 h_c AM + RML 18 pima AM + Original
(T/R), and rule-set induction time (Time) over all ten datasets 8 h_h AM + RML 19 sick AM + RML
+
for which the AM + Original has produced output. From this 9 h_stat AM Original 20 wdbc AM + Original
10 h_swiss AM + RML 21 wpbc AM + RML
table it can be seen that overall, the AM + RML model has
11 hep AM + RML
higher predictive accuracy than the AM + Original.
Table 4 Averages for the timing of removal of instances with missing val- In addition, the predictive accuracy, number of rules, num-
ues
ber of terms per rule, and computational time per rule set
AntMiner+ version Acc Rules T/R Time/s
are averaged over all datasets for each discretization method
AM + Original 70.99 ± 3.67 3.94 ± 0.88 2.91 ± 0.37 19.63 ± 4.26
AM + RML 73.48 ± 3.02 4.10 ± 1.17 2.86 ± 0.34 18.85 ± 6.02
and shown in Figs. 3 and 4. The grand average of all ten dis-
cretization methods over the 21 datasets is also displayed.
5.3 Discretization method From Fig. 2, it can be seen that even for the same learner
(AntMiner+), the performance across different datasets dif-
The AntMiner+ algorithm cannot directly handle numeric fers according to the discretization method used. Among the
attributes. Among the benchmark datasets, 21 contain nu- 21 datasets employed in this experiment, the difference in
meric attributes. Discretization is an essential step to trans- AntMiner+ performance associated with the ten models for
form these numeric attributes into a form that the AntMiner+ each dataset and resulting from the use of different discretiza-
algorithm can handle. The numeric attributes can now be han- tion methods is found to be statistically significant in 12
dled by AntMiner+ during rule induction as ordinal attributes. datasets. In particular, the difference is quite large in three
The Weka implementation of the four discretization meth- datasets. Namely, the following is noted.
ods is used. This results in ten models as will be shortly de-
scribed. The best performing model for each dataset will be 1) In the h_h dataset, the relative improvement obtained
outlined. The default discretization method in the implemen- by kon over fay exceeds 176% (= 80.17−28.96
28.96 × 100%).
tation adopted is fay. For binning discretization methods, the 2) In the liver_disorder dataset, efb10 improves over the
default number of bins is 10. Only datasets having continuous default discretization method fay for more than 41% in
attributes (Table 5) are included in this experiment. The col- predictive accuracy.
umn AntMiner+ version describes the version used (Original 3) The improvement obtained when using eib5 over the
or RML), as concluded in Section 5.2. The Friedman test is default fay is over 54% for the wpbc dataset as well.
used to test whether the difference among the predictive ac-
curacy of the selected models is considered statistically sig- In the remaining nine datasets, the difference among the
nificant. The discretization method associated with the best ten models for each dataset is not found to be statistically
rank is usually chosen. In few cases, a model other than the significant. This result is not surprising for the derma dataset.
best ranked is selected as justified. Reasoning will be usu- This dataset only has one numeric attribute against 32 nom-
ally based on the rule set size and/or the computational time inal attributes. However, for two datasets, namely new_thy
associated with these models. and wdbc, all the predictive attributes are numeric.
• Results Among the ten models generated, the best rank is obtained
The predictive accuracy with the associated standard de- by fay, followed by efb10 the highest number of times (5 : 21
viation obtained by AntMiner+ in combination with each and 4 : 21, respectively). If the four eib models were aggre-
of the used discretization methods is shown in Fig. 2. The gated (eib5, eib10, eib15, and eib20), then eib would score
discretization method selected for each dataset is shown in the best rank in (9 : 21) times followed by efb at (7 : 21).
Table 9. The highest number of times a model is selected belongs to
1092 Front. Comput. Sci., 2016, 10(6): 1082–1102
using entropy-based discretization methods. Models discretization method shows the discretization method as per
based on binning discretization methods require almost Section 5.3. The non-parametric Friedman test is used to test
double the time. whether the difference among the predictive accuracy of the
selected models is considered statistically significant. The
5.4 Feature subset selection method statistical comparison is only done among results associated
with the eight FSS methods. The model where no attribute
A diverse combination of FSS methods is included in the
selection is employed (AS0) is not included in the statistical
comparison. For each dataset, we compare the AntMiner+
comparisons. The FSS method associated with the best rank
rule induction performance that uses each of the different fea-
is usually chosen. In few cases, a model other than the best
ture subset selection methods listed in this section. We also
ranked is selected as justified. Reasoning is usually based on
include the AntMiner+ performance when no attribute selec-
the rule set size and/or the computational time associated with
tion is used in the preprocessing phase. This is done for all
these models.
datasets in the benchmark except the (arr) dataset, where the
number of attributes is relatively high. Weka Java implemen- Table 6 Datasets included in FSS experiment
tation with default settings for attribute selection methods is AntMiner+ Discretization
No. Dataset #Attr.
used. A list of the included methods is presented along with version method
1 arr 278 AM + RML fay
the corresponding synonym used in tables.
2 derma 33 AM + Original efb5
3 h_c 13 AM + RML eib5
4 h_h 13 AM + RML kon
5 h_stat 13 AM + Original efb10
6 h_swiss 13 AM + RML kon
7 hep 19 AM + RML efb20
8 horse 22 AM + RML eib5
9 hypo 29 AM + RML fay
10 lymph 18 AM + Original —
11 park 22 AM + Original eib10
12 p_tumor 17 AM + RML —
13 sick 29 AM + RML efb10
14 wdbc 32 AM + Original efb5
Fig. 3 Average predictive accuracy and time/s over all datasets for
15 wpbc 33 AM + RML eib5
AntMiner+ per discretization method
• Results
The predictive accuracy with the associated standard de-
viation obtained by AntMiner+, in combination with each of
the used FSS methods, is shown in Fig. 5. The FSS method
selected for each dataset is shown in Table 9.
In addition, the predictive accuracy, number of rules, num-
ber of terms per rule, and computational time per rule set are
averaged over all datasets for each FSS method and shown in
Figs. 6 and 7. The grand average of all the eight FSS methods
over the 15 datasets is also reported. Table 7 shows the (AS0)
case, where no attribute selection is employed. Averages are
Fig. 4 Average model size over all datasets for AntMiner+ per discretiza-
tion method limited over the ten datasets where AntMiner+ produced a
non-zero output. The corresponding average for all the FSS
The default feature subset selection method in the imple-
methods over the same datasets is also shown.
mentation [58] adopted is rel with ten as the default number
of attributes to retain. Therefore, only datasets having more Table 7 Averages over all datasets for AntMiner+ for FSS vs. all features
than ten attributes are included in this experiment (Table 6). FSS? Acc Rules T/R Time/s
The column AntMiner+ version describes the version used All FSS 75.97 ± 2.50 4.39 ± 1.01 3.22 ± 0.29 21.89 ± 5.48
0AS 73.79 ± 2.82 4.99 ± 1.04 3.89± 0.50 57.13 ± 9.61
(Original or RML) as concluded in Section 5.2. The column
1094 Front. Comput. Sci., 2016, 10(6): 1082–1102
Fig. 6 Average predictive accuracy and time/s over all datasets for
AntMiner+ per FSS method
(e.g., horse dataset). liefF method (rel). In all models, comparable rule set sizes
The second confirmed advantage is the acceleration of are found, as depicted in Fig. 7.
search when using FSS methods in general. It is noticed that
5.5 Rule evaluation function
using FSS methods reduces the computational time. For ex-
ample, this reduction is up to six times in the derma dataset. This experiment involves all 25 datasets in the benchmark.
In general, the computational time of AntMiner+ without FSS AntMiner+ is run on each dataset once for each of the six rule
is more than twice as much as that obtained by averaging the evaluation functions K, M, F, RCM, A+, and SS. The default
computational time of AntMiner+ combined with each of the parameter values recommended in the study by Minnaert et
eight FSS methods. Also, note that the solution size is larger al. [58] for K, M, F, and RCM are used and are shown in Ta-
and the accuracy is on average lower. ble 8. For each run, the settings concluded for each dataset in
When comparing results using the different FSS methods, the previous sections regarding the timing of the removal of
we note that out of the 15 datasets, the difference among missing instances, discretization algorithm, and attribute se-
AntMiner+ results, when combined with each of the eight lection method are used. The non-parametric Friedman test is
FSS methods, is considered statistically significant in 11 used to test whether the difference among the predictive accu-
datasets. The high imbalance ratio in some (e.g., hypo and racy of the six models is considered statistically significant.
sick) seems not to affect the results. Among these, the dif- The rule evaluation function associated with the best rank is
ference is extremely significant in h_h, h_swiss, and wpbc usually chosen.
datasets. For example, it can be seen that changing the FSS
Table 8 Default parameters for rule evaluation functions
method used with AntMiner+ for the h_h dataset can improve
Rule evaluation function Parameter Default value
the predictive accuracy from (41.82%) when using sym to
K ω 0.44
(80.17%) when using rel, thus effectively providing over 91% M m 0.28
improvement in accuracy. These three datasets (h_h, h_swiss, F β 7
and wpbc) exhibit the highest level of class noise combined RCM c 0.028
with highest percentage of instances having missing val-
ues, and small number of instances (below 300). Most FSS • Results
methods showed to be the preferred for at least one dataset, The predictive accuracy with the associated standard devi-
however, the methods rel followed by cfs obtained the ation obtained by AntMiner+ in combination with each of the
largest count of best ranks. When considering grand averages, used rule evaluation functions is shown in Fig. 8. The figure
Fig. 6 shows that the highest overall average is associated also displays the rule evaluation function selected for each
with the correlation-based FSS method (cfs). It also fea- dataset.
tures the highest computational time. The follow-up is Re- Additionally, the predictive accuracy, computational time
per rule set, number of rules, and number of terms per rule the predictive accuracy along with the standard deviation for
are averaged over all datasets for each discretization method the original AntMiner+ with the default settings versus those
and are shown in Figs. 9 and 10. The grand average of all ten for the tuned version. The final model for AntMiner+ after
discretization methods over the 25 datasets is also displayed. tuning is referred to as AM + T uned hereafter. Table 9 sum-
The policy on ties needs to be clarified. In the derma dataset, marizes those findings.
a tie is encountered between K and M. However, K is selected
as it produces the highest accuracy average and smaller model
size.
port a decision used by a human, in the medical field, the ac- rithms, Lim et al. [82] showed that the C4.5 algorithm had
ceptance of machine learning models presented for the pur- a good speed/accuracy performance. The implementation on
pose of medical diagnosis or prognosis is highly dependent the Weka benchmark is used for non-evolutionary learners
on its ability to be interpreted and validated [79]. Otherwise, with default settings. Since PART and J48 are determinis-
the user will not trust the model presented and even worst, tic algorithms, a single 10-fold cross-validation procedure
it might lead to wrong decisions. Artificial neural networks is used for evaluation. Evolutionary learners include SLAVE
and deep learning techniques are known for their compet- [83] and UCS [84]. SLAVE is a genetic iterative rule learning
itive predictive accuracy. However, these techniques gener- system based on fuzzy logic theory. UCS is a Michigan-style
ate black-box models, in the form of complex mathematical learning classifier system derived from XCS [85] and spe-
functions that are difficult for humans to comprehend. Sim- cialized for supervised learning tasks. In a comparative study
ilarly, statistical learning algorithms including Bayes classi- of genetic-based learning classifier systems [86], UCS stands
fiers are also sub-symbolic classification methods that require out as a robust classification algorithm. A 10-time, 10-fold
some background knowledge about the prior probabilities, cross-validation procedure is used for evaluation. For UCS
and thus are not appealing. Support vector machines [80,81] and SLAVE, the open-source software tool KEEL [7] is used
are among the newest and strongest classification techniques. with default settings.
However, the support vectors cannot easily communicate the Figure 12 depicts the predictive accuracy obtained
obtained knowledge to medical experts. Case-based meth- with each of the algorithms when applied to the bench-
ods and nearest-neighbor techniques merely store training in- mark datasets. The highest overall average belongs to J48
stances and do not create a classification model. Instances are (77.93%), closely followed by AM + T uned (77.89%). The
matched to the closest one stored and classification is based lowest is scored by Slave (68.05%). The Friedman non-
on the closest match(es). Thus, this paradigm is also not suit- parametric test shows that the difference in predictive ac-
able for the requirements in hand. curacy among the five models (AM + T uned, Part, J48, UCS,
To compare the results of AM + T uned, a selection of com- and SLAVE) over the 25 datasets is considered statistically
petitive classification algorithms is included. PART [66] is significant (p-value = 0.002). The best rank was scored by
a rule-based classification algorithm. PART extracts rules AM + T uned. However, the Holm post-hoc analysis did not
from decision trees created by the J48 algorithm [28]. In a detect a difference among the four models (AM + T uned, Part,
large comparative study on thirty-three classification algo- J48, and UCS). Thus, the AM + T uned algorithm is considered
Fig. 12 Average predictive accuracy for AM + T uned, Part, J48, UCS, and SLAVE
1098 Front. Comput. Sci., 2016, 10(6): 1082–1102
comparable to state-of-the-art classification algorithms. tional complexity but also enhances model comprehensibility
and classification accuracy, particularly in small sample size
datasets [29].
Finding the optimal m feature subset from all
6 Discussion
possible mn feature subsets has been shown to be an NP-hard
Several medical datasets are associated with a considerable combinatorial optimization problem [88]. In this study, it is
percentage of missing values. However, when the intension is found that for datasets having more than 10 attributes, using
to mine knowledge and extract conclusions, one must avoid FSS methods always proves fruitful in comparison to induc-
drawing unwarranted conclusions as much as possible; for tion without FSS. No FSS method performs the best over all
example, flagging imputed values when suitable and eval- datasets, and this is an expected result. However, when av-
uating their significance [87]. The safe handling of miss- eraged over all datasets, induction using the cfs FSS method
ing values is not a trivial task. In fact, the selection of the scores the highest predictive accuracy with no payoff in rule
most appropriate missing data handling method is a hard and set size.
complex task [39]. This study adopts the removal of miss- When considering different features of the datasets, in-
ing values in medical datasets and investigates the best tim- cluding number of instances, number and type of attributes,
ing to apply it. The removal of instances having missing imbalance ratio, noise, and missing values, the impact of
values results in information loss. However, if this step is class noise, number of instances, and missing values on
delayed after FSS, then the percentage of information loss FSS are found to be high. Missing values especially have a
is considerably decreased, thus allowing more instances for strong effect on induction using FSS. From this experiment,
the training and testing process. This conclusion particularly and among the FSS methods considered, the preferred FSS
holds for datasets having a larger number of predictor fea- method with each medical dataset has been highlighted.
tures. The study also finds that despite the noise normally As for the rule evaluation function, the results obtained in
associated with medical datasets, providing more instances this experiment confirm those found by Minnaert et al. [58]
to the learning algorithm improves the classification results. that, overall, the K function is the best among the previous
In some cases, the handling of instances with missing values six functions for the AntMiner+ algorithm.
may explain the difference between an algorithm that com- The tuning process described here is not meant to be ex-
pletely fails to run and another that produces good results. haustive due to time limitations, and thus, we do not claim
Discretization enhances model comprehensibility and ex- that the declared results are the best settings for each dataset.
cels the search. The selection of a discretization method de- Even within a single dataset, there may be different require-
pends on the problem tackled and the learner used. It is im- ments related to different attributes. For example, each nu-
portant to find a balance between the number of intervals meric attribute, within the same dataset, may have a different
generated and the performance obtained, as the search space optimal discretization method. In this regard, Bacardit et al.
grows exponentially with the number of intervals. The choice [89] include the selection of the best discretization method
of the discretization method has a profound effect on the for each attribute within the genetic algorithm (GA) cycle.
model performance. It is not possible to identify a single However, these steps are intended to improve the initial con-
discretization method as the best over all medical datasets. figuration and tailor it, to some extent, for each dataset in the
Among the discretization methods included in this experi- described benchmark.
ment, the use of entropy-based discretization methods has When evaluating the performance of AM + T uned, a closer
a computational cost advantage in terms of model complex- look at Fig. 11 shows that no significant difference is encoun-
ity and computational time. Discretization methods based on tered in 11 out of the 25 datasets. One interesting result is
binning obtain overall higher averages in predictive accuracy that there is no improvement obtained at all during the tuning
and lower variance than those based on entropy. The best way process for the new_thy dataset. By investigating the dataset
is to experiment with different discretization methods and se- characteristics, we note that it has no missing values, and no
lect the one producing the best balance of performance versus attribute selection is needed as it contains only five attributes.
cost defined as computational time and model complexity. The tuning step concludes that the fay discretization method
Like removing instances having missing values, FSS re- and K rule evaluation function are found to be the best suited.
sults in a loss of information as entire attributes are elimi- These are the same settings in the original AntMiner+ imple-
nated. However, the aim is to remove irrelevant, redundant, mentation, and that explains the situation.
or noisy features. FSS not only reduces storage and computa- In five datasets out of the remaining fourteen datasets (h_h,
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1099
h_swiss, horse, hypo, and sick), the difference is of suc- mining task and for each dataset on an individual basis. The
cess/failure in obtaining an output. The difference is statis- idea is suitable for off-line applications.
tically significant in the nine remaining datasets (arr, cmc, This work shows the results of the tuning for the prepro-
derma, echo, haber, liver, park, p_tumor, and wpbc). Thus, cessing stage, which is applied to AntMiner+ as an illustra-
in the majority of datasets, there is a significant improve- tive example. For each dataset, the timing of removing in-
ment achieved via the tuning process. The grand average stances with missing values was examined. Experimentations
over the 20 datasets in the benchmark, where the output is are done with different feature subset selection and discretiza-
obtained by AM+ Original, shows an overall significant im- tion methods for each dataset. The library used includes a va-
provement obtained through the tuning process, as confirmed riety of common discretization and attribute selection meth-
by the Wilcoxon test that is applied to compare the two mod- ods, from open-source, on-line libraries. Also, the use of sev-
els (AM+ Original [72.65 ± 2.89] and AM + T uned [78.08 ± eral rule evaluation functions inside the AntMiner+ algorithm
1.74]) for the same 20 datasets. The resulting model is robust is investigated. As with any tuning process, the proposed
and comparable to state-of-the-art classification algorithms. method is time consuming. However, the disposal of open-
Although this study specifically addresses medical source libraries associated with rudimentary standard formats
datasets, the recommended preprocessing procedure can be facilitates a convenient custom preprocessing stage tuning.
applied to arbitrary datasets with similar features. First, the Experiments show that there is a significant improvement in
existence of missing values is addressed. If the dataset con- classification performance measured by predictive accuracy
tains missing values, then the timing of removing instances and obtained in the majority of datasets in the benchmark
with missing values should be examined, whether it is done through the individualized tuning of the preprocessing op-
before or after the FSS step. Next, the discretization process erations. Moreover, given a certain classification algorithm,
is examined. The discretization procedure applies to datasets the design of the preprocessing stage can make the difference
with numeric attributes, especially if the selected classifica- between complete failure and the achievement of results that
tion algorithm cannot deal directly with numeric attributes. are competitive to rival classification algorithms in the same
If this is the case, then a number of discretization methods datasets.
should be investigated, and the associated classification re- Several datasets in the benchmark are associated with a
sults for the resulting models compared. Once the best model large percentage of missing values. Removing instances with
is chosen according to measures of concern such as predictive missing values is definitely not the best choice for handling
accuracy or model complexity, FSS step is considered. This such instances, particularly if performed before attribute se-
step is particularly recommended for datasets with a small lection. Despite the noise associated with medical data, the
number of instances but a large number of features. Similar experiments show that providing more instances for train-
to the discretization step, a number of feature subset selection ing the rule induction algorithm improves the rule induc-
methods are to be explored and the resulting classification tion experience. Imputation techniques that are based on ma-
models compared. chine learning methods like ANN and SVM require signifi-
cant computational time. The mechanism by which data are
missing must be first identified and the method for handling
7 Conclusion the missing values accordingly decided. The use of the re-
placement method for missing values prevents the loss of a
Data preprocessing has a profound effect on the performance significant amount of information.
of the learner. Each dataset is different, and there is no pre- Among the different discretization methods examined in
processing method that is best across all datasets. Deciding this research, experiments find that entropy-based methods
the best combination of preprocessing methods for a spe- are faster and result in smaller models than those based on
cific dataset is not possible without trial and comparisons. binning. However, binning methods achieved a higher pre-
Technology is advancing rapidly. The advent of various open- dictive accuracy with less variance. As for feature subset
source libraries, like Weka and KEEL, hosting an extensive selection methods, the impact of class noise, the number of
set of off-the-shelf preprocessing methods, combined with instances, and missing values on feature subset selection are
the leisure of standard formats like the ARFF, and advances found to be high. Missing values especially have a strong
in computer hardware technology, persuades the integration effect. However, these findings are specific to the AntMiner+
of automatic tuning for preprocessing operations into the data algorithm. The real bounty of this step is that improving
1100 Front. Comput. Sci., 2016, 10(6): 1082–1102
the classification potential of a dataset is now a convenient tion problem. Pattern Recognition Letters, 2009, 30(5): 525–534
problem-centered approach to computation. 17. Jourdan L, Dhaenens C, Talbi E G. A genetic algorithm for features
election in datamining for genetics. In: Proceedings of the 4th Meta-
Acknowledgements This work was supported by the Research Center of heuristics International Conference Porto. 2010: 29–34
College of Computer and Information Sciences, King Saud University, Saudi 18. Huang J J, Cai Y Z, Xu X M. A hybrid genetic algorithm for fea-
Arabia. The authors are grateful for this support.
tures election wrapper based on mutual information. Pattern Recogni-
tion Letters, 2007, 28(13): 1825–1844
19. AI-Ani A. Feature subset selection using ant colony optimization. In-
References ternational Journal of Computational Intelligence, 2005, 2(1): 53–58
1. Pham H N A, Triantaphyllou E. An application of a new metaheuristic 20. Unler A, Murat A. A discrete particle swarm optimization method for
for optimizing the classification accuracy when analyzing some medi- feature selection in binary classification problems. European Journal of
cal datasets. Expert Systems with Applications, 2009, 36: 9240–9249 Operational Research, 2010, 206(3): 528–539
2. Almuhaideb S, El-Bachir Menai M. Hybrid metaheuristics for med- 21. Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word
ical data classification. In: El-Ghazali T, ed. Hybrid Metaheuristics. clusters vs. words for text categorization. Journal of Machine Learning
Springer, 2013, 187–217 Research, 2003, 3: 1183–1208
3. Penã-Reyes C A, Sipper M. Evolutionary computation in medicine: an 22. Liu H, Yu L. Toward integrating feature selection algorithms for clas-
overview. Artificial Intelligence in Medicine, 2000, 19(1): 1–23 sification and clustering. IEEE Transactions on Knowledge Discovery
and Data Engineering, 2005, 17(4): 491–502
4. Tanwani A K, Afridi J, Shafiq M Z, Farooq M. Guidelines to select
machine learning scheme for classification of biomedical datasets. In: 23. Shin K, Fernandes D, Miyazaki S. Consistency measures for features
Pizzuti C, Ritchie M D, Giacobini M, eds. Evolutionary Computation, election: a formal definition, relative sensitivity comparison, and a fast
Machine Learning and Data Mining in Bioinformatics. Springer, 2009, algorithm. In: Proceedings of International Conference on Artificial
28–139 Intelligence (IJCAI). 2011, 1491–1497
5. Almuhaideb S, El-Bachir Menai M. A new hybrid metaheuristic for 24. Kerber R. ChiMerge: discretization of numeric attributes. In: Proceed-
medical data classification. International Journal of Metaheuristics, ings of the 10th National Conference on Artificial Intelligence. 1992,
2014, 3(1): 59–80 123–128
6. Milne D, Witten I H. An open-source toolkit for mining Wikipedia. 25. Liu H, Setiono R. Feature selection via discretization. IEEE Transac-
Artificial Intelligence, 2013, 194: 222–239 tions on Knowledge and Data Engineering, 1997, 9(4): 642–645
7. Alcalá-fdez J, L. Sánchez L, García S, del Jesus M J, Ventura S, Garrell 26. Fayyad U M, Irani K B. Multi-interval discretization of continuous-
J M, Otero J, Bacardit J, Rivas V M, Fernández J C, Herrera F. KEEL: valued attributes for classification learning. In: Proceedings of Interna-
a software tool to assess evolutionary algorithms to data mining prob- tional Conference on Artificial Intelligence. 1993, 1022–1029
lems. Soft Computing, 2009, 13(3): 307–318 27. Jin R M, Breitbart Y, Muoh C. Data discretization unification. Knowl-
8. Martens D, de Backer M, Haesen R, Vanthienen J, Snoeck M, Baesens edge and Information Systems, 2009, 19(1): 1–29
B. Classification with ant colony optimization. IEEE Transactions on 28. Quinlan R. C4.5: Programs for Machine Learning. San Mateo, CA:
Evolutionary Computation, 2007, 11(5): 651–665 Morgan Kaufmann Publishers, 1993
9. Tanwani A K, Farooq M. Performance evaluation of evolutionary al- 29. Guyon I, Elisseeff A. An introduction to variable and feature selection.
gorithms in classification of biomedical datasets. In: Proceedings of The Journal of Machine Learning Research, 2003, 3: 1157–1182
the 11th Annual Conference Companion on Genetic and Evolutionary 30. Kohavi R, John G H. Wrappers for feature subsets election. Artificial
Computation: Late Breaking Papers. 2009, 2617–2624 Intelligence, 1997, 97(1–2): 273–324
10. Tanwani A K, Farooq M. The role of biomedical dataset inclassi- 31. Caruana R, Freitag D. Greedy attribute selection. In: Proceedings of
fication. In: Proceedings of Conference on Artificial Intelligence in International Conference on Machine Learning. 1994, 28–36
Medicine in Europe. 2009 32. Koza J R. Genetic Programming: On the Programming of Computers
11. Tanwani A K, Farooq M. Classification potential vs. classification by Means of Natural Selection. Cambridge, MA: MIT Press, 1992
accuracy: a comprehensive study of evolutionary algorithms with 33. Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and
biomedical datasets. Learning Classifier System, 2010: 127–144 Regression Trees. New York: Chapman & Hall, 1984
12. Kotsiantis S B. Feature selection for machine learning classification 34. Das S. Filters, wrappers and a boosting-based hybrid for feature selec-
problems: a recent overview. Artificial Intelligence Review, 2011: tion. In: Proceedings of International Conference on Machine Learn-
249–268 ing. 2001, 74–81
13. Whitney A W. A direct method of nonparametric measurement selec- 35. Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd
tion. IEEE Transactions on Computers, 1971, 20(9): 1100–1103 edition. London: Morgan Kaufmann Publishers, 2006
14. Marill T, Green D. On the effectiveness of receptors in recognition sys- 36. Chlebus B S, Nguyen S H. On finding optimal discretizations for two
tems. IEEE Transactions on Information Theory, 1963, 9(1): 11–17 attributes. In: Polkowski L, Skowron A, eds. Rough Sets and Current
15. Pudil P, Novovičová J, Kittler J. Floating search methods in features Trends in Computing. Springer, 1998, 537–544
election. Pattern Recognition Letters, 1994, 15(10): 1119–1125 37. García S, Luengo J, Sáez J A, López V, Herrera F. A survey of dis-
16. Yusta S C. Different metaheuristic strategies to solve the feature selec- cretization techniques: taxonomy and empirical analysis in supervised
Sarab ALMUHAIDEB et al. Impact of preprocessing on medical data classification 1101
learning. IEEE Transactions on Knowledge and Data Engineering, M, Birattari M, Blum C, Clerc M, Stützle T, Winfield A F T, eds. Ant
2013, 25(4): 734–750 Colony Optimzation and Swarm Intelligence. Springer, 2008, 387–394
38. Wong A K C, Chiu D K Y. Synthesizing statistical knowledge from 57. Cohen W W. Fast effective rule induction. In: Prieditis A, Russell S
incomplete mixed-mode data. IEEE Transactions on Pattern Analysis J, eds. International Conference on Machine Learning. Morgan Kauf-
and Machine Intelligence, 1987, 9(6): 796–805 mann, 1995, 115–123
39. Garcá-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern 58. Minnaert B, Martens D, de Baker M, Baesens B. To tune or not to tune:
classification with missing data: a review. Neural Computing and Ap- rule evaluation for metaheuristic-based sequential covering algorithms.
plications, 2010, 19(2): 263–282 Data Mining and Knowledge Discovery, 2015, 29(1): 237–272
40. Grzymala-Busse J W, Goodwin L K, Grzymala-Busse W J, Zheng X Q. 59. Almuhaideb S, ElBachir Menai M. A new hybrid metaheuristic for
Handling missing attribute values in preterm birth data sets. In: Slezak medical data classification. International Journal of Metaheuristics,
D, Yao J T, Peters J F, Ziarko W, Hu X H, eds. Rough Sets, Fuzzy Sets, 2014: 1–17
Data Mining, and Granular Computing. Springer, 2005, 342–351 60. Rissanen J. Modeling by shortest data description. Automatica, 1978,
41. Batista G E A P A, Monard M C. An analysis of four missing data treat- 14(5): 465–471
ment methods for supervised learning. Applied Artificial Intelligence, 61. Kononenko I. On biases in estimating multi-valued attributes. In: Pro-
2003, 17(5–6): 519–533 ceedings of International Conference on Artificial Intelligence. 1995,
42. Feng H H, Chen G S, Yin C, Yang B R, Chen Y M. A SVM regression 1034–1040
based approach to filling in missing values. In: Khosla R, Howlett R J, 62. Kira K, Rendell L A. A practical approach to feature selection. In: Pro-
Jain L C, eds. Knowledge-Based Intelligent Information and Engineer- ceedings of the 9th International Workshop on Machine Learning. 1992
ing Systems. Springer, 2005, 581–587 63. Kononenko I. Estimating attributes: analysis and extensions of RE-
43. Gupta A, Lam M S. Estimating missing values using neural networks. LIEF. In: Proceedings of European Conference on Machine Learning.
Journal of the Operational Research Society, 1996, 47(2): 229–238 1994, 171–182
44. Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in- 64. Hall M A. Correlation-based feature selection for machine learning.
complete data via the EM algorithm. Journal of the Royal Statistical Dissertation for the Dotoral Degree. Hamilton, New Zealand: Univer-
Society, Series B (Methodological), 1977, 39(1): 1–38 sity of Waikato, 1999
45. Schneider T. Analysis of incomplete climate data: estimation of mean 65. Liu H, Setiono R. A probabilistic approach to feature selection—a fil-
values and covariance matrices and imputation of missing values. Jour- ter solution. In: Proceedings of International Conference on Machine
nal of Climate, 2001, 14: 853–871 Learning. 1996, 319–327
46. Gourraud P A, Génin E, Cambon-Thomsen A. Handling missing values 66. Frank E, Witten I H. Generating accurate rule sets without global op-
in population data: consequences for maximum likelihood estimation timization. In: Proceedings of the 15th International Conference on
of haplotype frequencies. European Journal of Human Genetics, 2004, Machine Learning. 1998, 144–151
12: 805–812 67. Holte R C. Very simple classification rules perform well on most com-
47. Mcculloch W, Pitts W. A logical calculus of the ideas immanent in ner- monly used datasets. Machine Learning, 1993, 11(1): 63–91
vous activity. Bulletin of Mathematical Biophysics, 1943, 5: 115–133 68. Klösgan W. Problems for knowledge discovery in databases and their
48. Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor: treatment in the statistics interpreter explora. International Journal of
The University of Michigan Press, 1975 Intelligent Systems, 1992, 7(7): 649–673
49. Dorigo M. Optimization, learning and natural algorithms. Dissertation 69. Janssen F, Fürnkranz J. On the quest for optimal rule learning heuris-
for the Doctoral Degree. Politecnico di Milano, Italy, 1992 tics. Machine Learning, 2010, 78(3): 343–379
50. Kennedy J, Eberhart R. Particle swarm optimization. In: Proceedings 70. Martens D, Baesens B, Fawcett T. Editorial survey: swarm intelligence
of IEEE International Conference on Neural Networks. 1995, 1942– for data mining. Machine Learning, 2010, 82(1): 1–42
1948 71. Hanczara B, Dougherty E R. The reliability of estimated confidence in-
51. Sato T, Hagiwara M. Bee system: finding solution by a concentrated tervals for classification error rates when only a single sample is avail-
search. In: Proceedings of IEEE International Conference on Systems, able. Pattern Recognition, 2013, 64(3): 1067–1077
Man, and Cybernetics. 1997 72. Kohavi R. A study of cross-validation and bootstrap for accuracy esti-
52. Karaboga D. An idea based on honey bee swarm for numerical opti- mation and model selection. In: Proceedings of International Confer-
mization. Technical Report TR06, Erciyes University, 2005 ence on Artificial Intelligence. 1995, 1137–1145
53. Dorigo M, Gambardella L M. Ant colony system: a cooperative learn- 73. García S, Fernández A, Luengo J, Herrera F. A study of statistical tech-
ing approach to the traveling salesman problem. IEEE Transactions on niques and performance measures for genetics-based machine learning:
Evolutionary Computation, 1997, 1(1): 53–66 accuracy and interpretability. Soft Computing, 2009, 13(10): 959–977
54. Parpinelli R S, Lopes H S, Freitas A A. Data mining with an ant colony 74. Wilcoxon F. Individual comparisons by ranking methods. Biometrics
optimization algorithm. IEEE Transactions on Evolutionary Computa- Bulletin, 1945, 1(6): 80–83
tion, 2002, 6(4): 321–332 75. Friedman M. The use of ranks to avoid the assumption of normality
55. Stützle T, Hoos H H. MAX-MIN ant system. Future Generation Com- implicit in the analysis of variance. American Statistical Association,
puter Systems, 2000, 16(8): 889–914 1937, 32(200): 675–701
56. Pellegrini P, Ellero A. The small world of pheromone trails. In: Dorigo 76. Frank A, Asuncion A. UCI machine learning repository. Irvine, CA:
1102 Front. Comput. Sci., 2016, 10(6): 1082–1102
University of California, 2010 for DNA microarrays. Bioinformatics, 2001, 17(6): 520–525
77. Napierala K, Stefanowski J. BRACID: a comprehensive approach to 88. Amaldi E, Kann V. On the approximability of minimizing nonzero vari-
learning rules from imbalanced data. Journal of Intelligent Information ables or unsatisfied relations in linear systems. Theoretical Computer
Systems, 2012, 39(2): 335–373 Science, 1998, 209(1–2): 237–260
78. Orriols-Puig A, Bernadó-Mansilla E. The class imbalance problem in 89. Bacardit J, Butz M. Data mining in learning classifier systems: com-
UCS classifier system: a preliminary study. In: Proceedings of the paring XCS with gassist. In: Proceedings of International Conference
2003–2005 International Conference on Learning Classifier Systems. on Learning Classifier Systems (IWLCS 2003–2005). 2004, 282–290
2007, 161–180
79. Pazzani M J, Mani S, Shankle W R. Acceptance of rules generated by
Sarab Almuhaideb is a PhD student in the Department of Com-
machine learning among medical experts. Methods of Information in
Medicine, 2001, 40(5): 380–385 puter Science, King Saud University, Saudi Arabia. She is a lecturer
80. Vapnik V N. Estimation of Dependences Based on Empirical Data. in the Department of Computer Science, Prince Sultan University,
Springer-Verlag, 1982 Saudi Arabia. Her research interests include issues related to ma-
81. Vapnik V N. The Nature of Statistical Learning Theory. New York: chine learning, evolutionary computation, and hybrid metaheuris-
Springer, 1995
tics.
82. Lim T S, Loh W Y, Shih Y S. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000, 40(3): 203–228 Mohamed El Bachir Menai received his
83. Gonzalez A, Perez R. Slave: a genetic learning system based on an PhD degree in computer science from Men-
iterative approach. IEEE Transactions on Fuzzy Systems, 1999, 7(2): touri University of Constantine, Algeria,
176–191
and University of Paris VIII, France in
84. Bernadó-Mansilla E, Garrell-Guiu J M. Accuracy based learning clas-
2005. He also received a “Habilitation uni-
sifier systems: models, analysis and applications to classification tasks.
Evolutionary Computation, 2003, 11(3): 209–238 versitaire” in computer science from Men-
85. Wilson S W. Classifier fitness based on accuracy. Evolutionary Com- touri University of Constantine, in 2007 (it
putation, 1995, 3(2): 149–175 is the highest academic qualification in Al-
86. Orriols-Puig A, Casillas J, Bernadó-Mansilla E. A comparative study geria, France and Germany). He is currently a professor in the De-
of several geneticbased supervised learning systems. In: Bull L,
partment of Computer Science at King Saud University, Saudi Ara-
Bernadó-Mansilla E, Holmes J H, eds. Learning Classifier Systems in
Data Mining. Springer, 2008, 205–230 bia. His main interests include evolutionary computing, data min-
87. Troyanskaya O G, Cantor M, Sherlock G, Brown P O, Hastie T, Tib- ing, machine learning, natural language processing, and satisfiabil-
shirani R, Botstein D, Altman R B. Missing value estimation methods ity problems.