Improving Floating Search Feature Selection Using Genetic Algorithm
Improving Floating Search Feature Selection Using Genetic Algorithm
Improving Floating Search Feature Selection Using Genetic Algorithm
Abstract. Classification, a process for predicting the class of a given input data,
is one of the most fundamental tasks in data mining. Classification performance
is negatively affected by noisy data and therefore selecting features relevant to
the problem is a critical step in classification, especially when applied to large
datasets. In this article, a novel filter-based floating search technique for feature
selection to select an optimal set of features for classification purposes is
proposed. A genetic algorithm is employed to improve the quality of the features
selected by the floating search method in each iteration. A criterion function is
applied to select relevant and high-quality features that can improve
classification accuracy. The proposed method was evaluated using 20 standard
machine learning datasets of various size and complexity. The results show that
the proposed method is effective in general across different classifiers and
performs well in comparison with recently reported techniques. In addition, the
application of the proposed method with support vector machine provides the
best performance among the classifiers studied and outperformed previous
researches with the majority of data sets.
1 Introduction
Classification, a process for predicting the class of a given input data, is one of
the most fundamental tasks in data mining. A number of available methods are
commonly used for data classification, such as: decision trees; rule-based,
probabilistic and instance-based methods; support vector machines (SVMs);
and neural networks. Noisy and irrelevant data are major obstacles to data
mining. They adversely affect system performance in terms of classification
accuracy, building time, size, and interpretability of the model obtained [1,2].
These issues can introduce new properties in the problem domain. For example,
noise can lead to the creation of small clusters of examples of a particular class
in areas of the domain corresponding to another class, or it can cause missing
data of examples located in key areas within a specific class [3].
Received November 27th, 2016, Revised May 4th, 2017, Accepted for publication October 19th, 2017.
Copyright © 2017 Published by ITB Journal Publisher, ISSN: 2337-5787, DOI: 10.5614/itbj.ict.res.appl.2017.11.3.6
300 Kanyanut Homsapaya & Ohm Sornil
Consequently, Bins and Draper [5] proposed a technique to reduce a large set of
features (1,000) to a much smaller subset without removing any highly
important features or decreasing classification accuracy. There are three steps in
the algorithm: first, irrelevant features are removed using a modified form of the
relief algorithm [6]; second, redundant features are eliminated using K-means
clustering [7]; and, finally, a combinatorial feature selection algorithm is
employed to the current feature subsets using the sequential floating backward
selection (SFBS) algorithm. The basic concept is to filter feature subsets in each
step until the smallest possible one is obtained.
2 Background
There are three main types of feature selection methods: filter, wrapper, and
hybrid. Wrapper methods rely on a classification algorithm employed as the
subset evaluation process for feature subsets [9]. Maroño et al. [10] proposed a
wrapper method by applying ANOVA decomposition and functional networks
to create the evaluation function. In general, the wrapper approach gives better
performance than the filter approach since the feature selection process is
optimized for the specific classification algorithm. Nevertheless, when wrapper
methods are applied to huge dimensional datasets, they will incur high
computational cost and may become unfeasible.
Hybrid methods exploit the positive aspects of both wrapper and filter methods
[4]. They utilize a filter-based technique to select highly representative features
and apply a wrapper-based technique to add candidate features and evaluate the
candidate subsets in order to select the best ones. This not only reduces the
dimensionality of the data but also decreases the computational cost and
improves classification performance. Somol, et al. [8] proposed a hybrid SFFS
method by employing an evaluation function to filter some features and using a
wrapper criterion to identify the optimal feature subset. Their experimental
results showed that the method yielded a promising classification accuracy.
SBFS, the counterpart of the forward search, is initialized with a full set and
sequentially eliminates one feature at a time after execution of SFFS. The SFFS
search selects the best unselected feature according to a criterion function to
form a new feature subset, while the SBFS search iteratively determines which
members of the selected subset are to be removed if the remaining set improves
performance according to the same criterion function as used in forward search.
The algorithm loops back to forward search until the stopping condition is
reached. There are disadvantages when using either algorithm. With SFFS it is
not possible to succeed in eliminating redundant features generated in the search
process, whereas SBFS cannot re-calculate evaluation feature usefulness
together with other features at the same time.
Floating Search Feature Selection using Genetic Algorithm 303
0 with probability of p
X
1 with probability of 1 - p,
H(X) = -p log p - (1 - p) log(1 - p) = H(p) (2)
Note that the entropy function does not depend on the values that the random
variable takes (0 and 1 in this case) but only depends on the probability
distribution, p(x).
GAs have been successfully applied to feature selection [20] with the objective
to save computational time without processing in an exhaustive fashion, which
is achieved by finding promising regions and selecting quality feature subsets.
Furthermore, hybrid GAs [21] are involved in a new search method that
includes local search operators to improve the fine-tuning quality of a simple
GA search.
The fitness function, based on the principle of survival of the fittest, is the
process whereby a GA evaluates each individual’s fitness and obtains the
optimal solution after applying the genetic operators. This process is repeated
many times and over many generations until the stopping criterion is satisfied.
For feature selection, the feature subsets are represented as a binary: a feature is
either included or not included in the feature subset.
Our feature improvement step based on GA is included after the exclusion step
in each iteration. The objective is to replace the weakest feature by checking
whether removing any feature in the currently selected feature subset and
adding a promising one at each sequential step potentially improves the current
feature subset. The chromosome structure consists of binary genes
corresponding to individual features. The value of 1 at the ith gene means that
the ith feature is selected; otherwise it is 0.
The initial population is generated from the resulting feature exclusion subsets
of size k + 1 from the exclusion step by first removing the weakest features
from the best subset resulting in a subset of size k. Each remaining feature is
thus added to that subset, generating the niched initial population for GA. The
fitness function used in this study is MI. Then, a new population is created by
selection, crossover and mutation. The process is terminated when the current
feature set reaches the size of D-2 features.
306 Kanyanut Homsapaya & Ohm Sornil
Y =Y +x ;m=m+1
x + = arg max [J(Ym − x)]
x ∈Ym +
m m
Figure 3 Pseudo code of the proposed algorithm.
Suppose the crossover point randomly occurs after the sixth bit, then each new
child receives one half of each parent’s bits
308 Kanyanut Homsapaya & Ohm Sornil
After all child chromosomes have passed through the mutation operator, the
resultant chromosomes are evaluated by the fitness function. After this, we can
discover the best performing features subset, which is (f5, f7, f6, f2). We
assume that J({f5, f7, f6}) = 8.35, and that J({f5,f7, f6, f2}) = 12, which is
larger than the prior largest value for four features, J = 11 Thus, the best four-
feature subset becomes {f5, f7, f6, f2} with J = 12, whereas the best three-
feature subset remains {f1, f4, f5} since J({f1, f4, f5}) = 9.1 > J({f5, f7, f6}) =
8.35
The improvement step helps discover subsets not discoverable by the greedy
nature of SFFS. From the above example, the SFFS algorithm is not able to
produce this best four-feature four subset because it cannot backtrack to the set
{f5, f7, f6} as a result of J({f1, f4, f5}) = 9.1 > J({f5, f7, f6}) = 8.35 and thus
cannot add feature f2 to subset {f5, f7, f6}. Note that f2 is never selected in the
first best four-feature sets of the SFFS method: {f1}, {f1, f4}, {f1, f4, f5}, and
{f1, f5, f6, f7}.
The example above demonstrates the advantage of our proposed algorithm. The
algorithm replaces the weak feature (feature f1 in our example) in the feature set
{f1,f5,f7,f6} with feature 2, which results in a new set of four features {f5, f7,
f6, f2}, which has a larger J value. Therefore, the search strategy of our
Floating Search Feature Selection using Genetic Algorithm 309
proposed algorithm is more thorough than the SFFS algorithm and thus it is
more effective.
4 Experimental Evaluations
To evaluate the proposed feature selection algorithm, 20 standard datasets of
various sizes and complexities from the UCI machine-learning repository [20]
were used in the experiments. These datasets have been frequently used as a
benchmark to compare the performance of classification methods and consist of
a mixture of numeric, real and categorical attributes. Numeric features are pre-
discretized by the method demonstrated in [23], which begins by sorting a
dataset and selecting only duplicate values for the cutting point bin. After this
step, the number of discrete values to represent each bin is found. The range
associated with an interval is divided into k intervals depending on the number
of replicated values. This modification enables the discretization process to be
faster and yields a higher performance than is otherwise possible.
Details of the datasets used in the experiments are shown in Table 1 From
experiments, we found that a suitable set of parameters is as follows: a
population size of 4-100 individuals, a bit-flip mutation rate of 0.01, and for
single point crossover, a rate of 0.75-0.85 and the number of generations is 500.
310 Kanyanut Homsapaya & Ohm Sornil
CART strategy is applied by choosing the feature that maximizes the decrease
of impurity ∆i(l) at each subsequent node
The Naïve Bayes algorithm is a statistical classifier for supervised learning [24]
and is based on the principle of conditional probability It can predict class
membership probabilities, such as the probability that a given sample belongs to
a particular class and its performance has been shown to be excellent in some
domains but poor in specific domains, e g those with correlated features. The
classification system is based on Bayes’ rule under the assumption that the
effect of an attribute on a given class is independent from the other attributes.
This assumption is called class conditional independence, which makes
computation simple. A conditional probability model for the classifier is given
as P ( Ci |x). Using Bayes’ theorem, we can write in Eq. (5) as follow:
Floating Search Feature Selection using Genetic Algorithm 311
(5)
where Ci is the ith class and x is the input vector. In this case, class variable C is
conditional on several feature variables x = x 1, , xn
SVMs, originally proposed by Cortes and Vapnik [25] have become important
in many classification problems for a variety of reasons, such as their flexibility,
computational efficiency and capacity to handle high dimensional data. They
are a recent method to extract information from a dataset. Classification is
achieved by a linear or nonlinear separating surface in the input space of the
dataset. SVMs have been applied to a number of applications, such as
bioinformatics, face recognition, text categorization, handwritten digit
recognition and so forth. SVM is a binary classifier that assigns a new data to a
class by minimizing the probability of error.
, , ∑ ξi,
( ) *) % )+0 (7)
) + 0,
0 ≪ ) ≪ , + 1, … . ,
where is a vector of all ones, & 0 is the upper bound,
* is an n by n positive semi-definite matrix, and * / + / 0 1 , 1/ "d,
3 "
where 0 21 , 1/ + ∅ 1 ∅ 1/ " is the kernel. Here, training vectors are
implicitly mapped into a higher (maybe infinite) dimensional space by the
function ∅.
Iris Real 150 4 3
Pima Indian diabetes Integer, Real 768 8 2
Categorical, Integer,
Abalone 4,177 8 3
Real
Dermatology Categorical, Real 366 34 6
Heart Categorical, Real 270 13 2
German (Credit Card) Categorical, Integer 1,000 20 2
Lung Cancer Integer 32 56 3
Soybean Integer 307 35 4
Spambase Integer, Real 4,601 57 2
Glass Identification Real 214 10 7
Teaching Assistant Categorical, Integer 151 5 3
Contact Lens Categorical 24 4 3
Sonar Real 208 60 2
Categorical, Integer,
Statlog (Australian) 690 14 2
Real
Ionosphere Integer, Real 351 34 2
Image Segmentation Real 2,310 19 7
The results in Table 2 show that the classification accuracy was noticeably
enhanced by the proposed algorithm with all classifiers compared to that
without feature selection. The best performance was where the accuracy
achieved 100% with 13, 22, 2, and 3 features selected for the Wine, Soybean,
Contact Lenses and Iris datasets, respectively, using SVM. Additionally, high
classification accuracy was achieved with small feature subsets Ionosphere,
Soybean, Breast Cancer (WDBC), and Statlog (Australian).
Floating Search Feature Selection using Genetic Algorithm 313
It can be seen that the classification accuracies using SVM, CART, and Naïve
Bayes significantly improved from 7% to 15% after applying the proposed
algorithm with feature subsets for the Wine, Breast Cancer, Statlog (Australian),
Soybean, and Ionosphere datasets. We also note that Naïve Bayes yielded lower
classification accuracy than SVM or CART.
actually achieved 100.00% selection accuracy in four datasets with the proposed
method. Regarding the classification methods, SVM yielded the highest
classification accuracy in 65% of the datasets, while CART gave the highest
accuracy in 35% of the datasets.
Soybean 100.00 100.00 88 30 97 80
98 30
Wine 100.00 100.00 91 60
Heart
80.00 81 10 61 10 84 80 87 10
Sonar 76.80 62 90 83 70
Abalone 52.00 58 00 54 50 30 00 25 70
Dermatol 98 9 95 40
98.80
ogy
Contact 75 00
76.00 100.00
Lenses
Not only did the proposed algorithm reduce features from 13 to 7, 35 to 22, and
34 to 5 for the Wine, Soybean, and Ionosphere datasets, respectively, but also
the classification accuracies improved by 12.35%, 17.64%, and 7.14% when
compared with the accuracy using full datasets. With the Soybean dataset, the
proposed algorithm reduced the number of features from 35 to 22 and the
classification accuracy using SVM was 100.00%, which is much higher
compared to the others methods. Moreover, the proposed algorithm also
reduced the number of features from 8 to 3 and 4 to 2 with the Abalone and
Floating Search Feature Selection using Genetic Algorithm 315
Contact Lens datasets, respectively, and accuracy was again higher compared to
the other methods.
5 Conclusions
Feature selection is critical to the performance of classification. In this paper,
we proposed a feature selection algorithm that improves the performance of
SFFS by incorporating a feature improvement step based on a genetic
algorithm. This step helps discover important subsets that are not possible using
SFFS alone. The algorithm employs mutual information as the feature subset
evaluation function. The proposed technique was evaluated using 20 standard
datasets from the UCI repository using three different classification methods.
The results show that the proposed feature selection technique significantly
improved classification accuracy and gave a much smaller feature set, thus
improving efficiency. In addition, it performed very well in comparison with
previously reported methods.
References
[1] Wu, X. & Zhu, X., Mining with Noise Knowledge: Error-aware Data
Mining, IEEE Transactions on Systems, Man, and Cybernetics – Part A:
Systems and Humans, 38(4), pp. 917-932, 2008.
[2] Sáez, José A., Luengo, J. & Herrera, F., Evaluating the Classifier
Behavior with Noisy Data Considering Performance and Robustness: the
Equalized Loss of Accuracy Measure, Neurocomputing, 176, pp. 26-35,
2016.
[3] Sáez, José A., Galar, M., Luengo, J. & Herrera, F. Analyzing the
Presence of Noise in Multi-class Problems: Alleviating Its Influence with
the One-vs-One Decomposition, Knowledge and Information
systems, 38(1), pp. 179-206, 2014.
[4] Dash, M. & Liu, H., Feature Selection for Classification, Intelligent Data
Analysis, 1(1-4), pp. 131–156, Mar. 1997.
316 Kanyanut Homsapaya & Ohm Sornil
[5] Bins, J. & Draper, B., Feature Selection from Huge Feature Sets,
Proceedings on Eighth IEEE International Conference on Computer
Vision, pp. 159-165, 2001.
[6] Kira, K. & Rendell, L., The Feature Selection Problem: Traditional
Methods and a New Algorithm, AAAI-92 Proceedings, pp. 129-134,
1992.
[7] MacQueen, J., Some Methods for Classification and Analysis of
Multivariate Observations, Proceedings of the Fifth Berkley Symposium
on Mathematics, Statistics and Probability, pp. 281-297, 1967.
[8] Somol, P., Pudil, P., Novovicova, J. & Pacliik, P., Flexible Hybrid
Sequential Floating Search in Statistical Feature Selection, Structural,
Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer
Science, Vol. 4109, Fred, A., Caelli, T., Duin, R.P.W., Campilho, A.,
Ridder, D., eds., pp. 632-639, Springer, 2006.
[9] Guyon, I. & Elisseeff, A., An Introduction to Variable and Feature
Selection, Journal of Machine Learning Research, 3, pp. 1157-1182,
2003.
[10] Maroño, N.S., Betanzos, A. & Castillo, E., A New Wrapper Method for
Feature Subset Selection, Proceedings European Symposium on Artificial
Neural Networks, pp. 515-520, 2005.
[11] Karegowda, A.G., Jayaram, M.A. & Manjunath, A.S., Feature Subset
Selection using Cascaded GA & CFS :A Filter Approach in Supervised
Learning, International Journal of Computer Applications, 23(2), pp. 1-
10, Jun. 2011.
[12] Zhang, H. & Sun, G., Feature Selection using Tabu Search Method,
Pattern Recognition, 35, pp. 701-711, 2002.
[13] Pudil, P. Novovicova, J. & Kittler, J., Floating Search Methods in
Feature Selection, Pattern Recognition Letters, 15, pp. 1119-1125, 1994.
[14] Nakariyakul, S. & Casasent, D.P. An Improvement on Floating Search
Algorithms for Feature Subset Selection, Pattern Recognition, 41(9), pp.
1932-1940, Sep 2009.
[15] Chaiyakarn, J. & Sornil, O., A Two-Stage Automatic Feature Selection
for Classification, International Journal of Advancements in Computing
Technology, 5(14), pp. 168-179, Oct .2013 .
[16] De Maesschalck, Roy, Jouan-Rimbaud, Delphine. & Massart, D.L., The
Mahalanobis Distance., Chemometrics and Intelligent Laboratory
Systems, 50, (1), pp. 1-18, 2000.
[17] Bruzzone, L. & Serpico, S.B., A Technique for Feature Selection in
Multiclass Problems, International Journal of Remote Sensing, 21(3), pp.
549-563, 2000.
[18] Battiti, R., Using Mutual Information for Selecting Features in
Supervised Neural Net Learning, IEEE Transactions on Neural Networks,
5(4), pp. 537-550, 1994.
Floating Search Feature Selection using Genetic Algorithm 317