A Study On Feature Selection Techniques in Bio Informatics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 2, No.1, January 2011

A study on Feature Selection Techniques in


Bio-Informatics
S.Nirmala Devi Dr. S.P Rajagopalan
Department of Master of Computer Applications Department of Master of Computer Applications
Guru Nanak College Dr.M.G.R Educational and Research Institute
Chennai, India Chennai, India
[email protected] [email protected]

Abstract— The availability of massive amounts of experimental The main aim of this study is to make aware of the
data based on genome-wide studies has given impetus in recent necessity and benefits of applying feature selection techniques.
years to a large effort in developing mathematical, statistical and It provides an overview of the different feature selection
computational techniques to infer biological models from data. techniques for classification by reviewing the most important
In many bioinformatics problems the number of features is application fields in the bioinformatics domain, and the efforts
significantly larger than the number of samples (high feature to done by the bioinformatics community in developing
sample ratio datasets) and feature selection techniques have procedures is highlighted. Finally, this study point to some
become an apparent need in many bioinformatics applications. useful data mining and bioinformatics software packages that
This article provides the reader aware of the possibilities of can be used for feature selection.
feature selection, providing a basic taxonomy of feature selection
techniques, discussing its uses, common and upcoming II. FEATURE SELECTION TECHNIQUES
bioinformatics applications.
Feature selection is the process of removing features from
Keywords- Bio-Informatics; Feature Selection; Text Mining; the data set that are irrelevant with respect to the task that is to
Literature Mining; Wrapper; Filter Embedded Methods. be performed. Feature selection can be extremely useful in
reducing the dimensionality of the data to be processed by the
I. INTRODUCTION classifier, reducing execution time and improving predictive
accuracy (inclusion of irrelevant features can introduce noise
During the last ten years, the desire and determination for
into the data, thus obscuring relevant features). It is worth
applying feature selection techniques in bioinformatics has
noting that even though some machine learning algorithms
shifted from being an illustrative example to becoming a real
perform some degree of feature selection themselves (such as
prerequisite for model building. The high dimensional nature
classification trees); feature space reduction can be useful even
of the modeling tasks in bioinformatics, going from sequence
for these algorithms. Reducing the dimensionality of the data
analysis over microarray analysis to spectral analyses and
reduces the size of the hypothesis space and thus results in
literature mining has given rise to a wealth of feature selection
faster execution time.
techniques are presented in the field.
As many pattern recognition techniques were originally
The application of feature selection techniques is focused
not designed to cope with large amounts of irrelevant features,
in this article. While comparing with other dimensionality
combining them with FS techniques has become a necessity in
reduction techniques like projection and compression, feature
many applications. The objectives of feature selection are
selection techniques do not alter the original representation of
the variables, but merely select a subset of the representation. (a) to avoid over fitting and improve model performance, i.e.
Thus, it preserves the original semantics of the variables and prediction performance in the case of supervised classification
Feature selection is also known as variable selection, feature and better cluster detection in the case of clustering
reduction, attribute selection or variable subset selection. (b) to provide faster and more cost-effective models
Feature selection helps to acquire better understanding (c) to gain a deeper insight into the underlying processes that
about the data by telling which the important features are and generated the data.
how they are related with each other and it can be applied to Instead of just optimizing the parameters of the model
both supervised and unsupervised learning. The interesting for the full feature subset, we now need to find the optimal
topic of feature selection for unsupervised learning model parameters for the optimal feature subset [1], as there is
(clustering) is a more complex issue, and research into this no guarantee that the optimal parameters for the full feature set
field is recently getting more attention in several communities are equally optimal for the optimal feature subset.
and the problem of supervised leaning is focused here, where
the class labels are known already.

138 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

There are three types of feature subset selection methods can be divided in two classes deterministic and
approaches: depending on how they combine the feature randomized search algorithms.
selection search with the construction of the classification
model: filters, wrappers and embedded methods which Advantages of Wrapper Method include the interaction
perform the features selection process as an integral part of a between feature subset search and model selection, and the
machine learning (ML) algorithm. Wrappers use a search ability to take into account feature dependencies.
algorithm to search through the space of possible features and Disadvantages are that they have a higher risk of over fitting
evaluate each subset by running a model on the subset. than filter techniques.
Wrappers can be computationally expensive and have a risk of III. APPLICATIONS IN BIOINFORMATICS
over fitting to the model. Filters are similar to Wrappers in the
search approach, but instead of evaluating against a model, a A. Feature Selection for Sequence Analysis
simpler filter is evaluated. Embedded techniques are A multistage process that includes the determination of a
embedded in and specific to a model. sequence (protein, carbohydrate, etc.), its fragmentation and
A. Filter Methods analysis, and the interpretation of the resulting sequence
information. This information is useful in that it: (a) reveals
These methods do not require the use of a classifier to the similarities of homologous genes, thereby providing
select the best subset of features. They use general insight into the possible regulation and functions of these
characteristics of the data to evaluate features. Filter genes; and (b) leads to a better understanding of disease states
techniques use the intrinsic properties of the data to assess the related to genetic variation. New sequencing methodologies,
relevance of features. In many cases the low-scoring features fully automated instrumentation, and improvements in
are removed and feature relevance score is calculated, then sequencing-related computational resources contribute to the
this subset is given as input to the classification algorithm. potential for genome-size sequencing projects.
They are pre-processing methods. They attempt to assess In the context of feature selection, two types of
the merits of features from the data, ignoring the effects of the problems can be distinguished: signal and content analysis.
selected feature subset on the performance of the learning Signal analysis focuses on identifying the important motifs in
algorithm. Examples are methods that select variables by the sequence, such as gene regulatory elements or structural
ranking them through compression techniques or by elements. On the other hand content analysis focuses on the
computing correlation with the output. broad characteristics of a sequence, such as tendency to code
Advantages of filter techniques are that they are for proteins or fulfillment of a certain biological function and
independent of the classification algorithm, computationally feature selection techniques are then applied to focus on the
simple and fast and easily scale to very high-dimensional subset of relevant variables.
datasets. Feature selection needs to be performed only once, 1) Content Analysis
and then different classifiers can be evaluated. In early days of bioinformatics the prediction of
Disadvantages of filter methods is that they ignore the subsequence’s that code for proteins has been focused. Many
interaction with the classifier i.e., the search in the feature versions of Markov models were developed because many
subset space is separated from the search in the hypothesis features are extracted from a sequence, and most dependencies
space. Each feature is considered separately and compared to occur between adjacent positions. Interpolated Markov model
other types of feature selection techniques it lead to worse was introduced to deal with limited amount of samples [2],
classification performance thereby ignoring feature and the high amount of possible features. This method used
dependencies. A number of multivariate filter techniques were filter method to select only relevant features and interpolation
introduced in order to overcome the problem of ignoring between different orders of the Markov model to deal with
feature dependencies. small sample sizes. Later Interpolated Markov Model was
extended to deal with non-adjacent feature dependencies,
B. Wrapper methods resulting in the interpolated context model (ICM), which
These methods assess subsets of variables according to crosses a Bayesian decision tree with a filter method (λ2) to
their usefulness to a given predictor. The method conducts a assess feature relevance. Recognition of promoter regions and
search for a good subset using the learning algorithm itself as the prediction [3], of microRNA targets are the use of FS
part of the evaluation function. The problem boils down to a techniques in the domain of sequence analysis.
problem of stochastic state space search. Examples are the
stepwise methods proposed in linear regression analysis. This 2) Signal Analysis
method embeds the model hypothesis search within the feature For the recognition of short, more or less conserved signals
subset search. A search procedure of possible feature subsets in the sequence many sequence analysis methods are used and
is defined and various subsets of features are generated and also to represent the binding sites for various proteins or
evaluated. The training and testing a specific classification protein complexes. Regression Approach is the common
model evaluation produces a specific subset of features. A approach to find regulatory motifs and to relate motifs to gene
search algorithm is then ‘wrapped’ around the classification expression levels to search for the motifs that maximize the fit
model to search the space of all feature subsets. These search to the regression model [4], Feature selection is used .In 2003

139 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

to find discriminative motifs a classification approach is (c) multivariate gene selection techniques the needs extra
chosen . This method uses the threshold number of computation time .
misclassification (TNoM) to score genes for relevance to (d) the possible unawareness of subgroups of gene expression
tissue classification. From the TNoM score, to represents the domain experts about the existence of data analysis techniques
significance of each motif a P-value is calculated and to select genes in a multivariate way;
according to their P-value Motifs are then sorted. The detection of the threshold point in each gene that
Another line of research is performed in the context of the reduces the number of training sample misclassification and
gene prediction setting, where structural elements such as the setting a threshold on the observed fold-change differences in
translation initiation site (TIS) and splice sites are modeled as gene expression between the states under study are some of
specific classification problems. In future research, FS the simplest heuristic rule for the identification of
techniques can be expected to be useful for a number of differentially expressed genes. A wide range of new univariate
challenging prediction tasks, such as identifying relevant feature ranking techniques has since then been developed.
features related to alternative TIS and alternative splice sites . These techniques can be divided into two classes: parametric
and model-free methods.
B. Feature Selection for Microarray Analysis
Parametric methods assume a given distribution from
The human genome contains approximately 20,000 genes. which the observations (samples) have been generated. t-test
At any given moment, each of our cells has some combination and ANOVA are the two samples among the most widely
of these genes turned on, and others are turned off. Scientists used techniques in microarray studies, although the usage of
can answer this question for any cell sample or tissue by gene their basic form, possibly without justification of their main
expression profiling, using a technique called microarray assumptions, is not advisable [5]. To deal with the small
analysis. Microarray analysis involves breaking open a cell, sample size and inherent noise of gene expression datasets
isolating its genetic contents, identifying all the genes that are include a number of t- or t-test like statistics (differing
turned on in that particular cell and generating a list of those primarily in the way the variance is estimated) and a number
genes. of Bayesian frameworks are the modifications of the standard
During the last decade, the introduction of microarray t-test. Regression modeling approaches and Gamma
datasets stimulated a new line of research in bioinformatics. distribution models are the other types of parametrical
Microarray data pose a great challenge for computational approaches found in the literature.
techniques, because of their small sample sizes and their large Due to the uncertainty about the true underlying
dimensionality. Furthermore, additional experimental distribution of many gene expression scenarios, and the
complications like noise and variability render the analysis of difficulties to validate distributional assumptions because of
microarray data an exciting domain. A dimension reduction small sample sizes, non-parametric or model-free methods
technique was realized in order to deal with these particular have been widely proposed as an attractive alternative to make
characteristics of microarray data and soon their application less stringent distributional assumptions. The Wilcox on rank-
became a de facto standard in the field. Whereas in 2001, the sum test [6], between-within classes sum of squares
field of microarray analysis was still claimed to be in its (BSS/WSS) [7], and the rank products method [8]. Are the
infancy a considerable and valuable effort has since been done model-free metrics of statistics field have demonstrated their
to contribute new and adapt known FS methodologies. usefulness in many gene expression studies.
1) The Univariate Filter Paradigm These model-free methods uses random permutations of
This Method is simple yet efficient because of the high the data to estimate the reference distribution of the statistics
dimensionality of most microarray analyses, fast and efficient allowing the computation of a model-free version of the
FS techniques such as univariate filter methods have attracted associated parametric tests. These techniques deal with the
most attention. The prevalence of these techniques has specificities of DNA microarray data, and do not depend on
dominated the field and now comparative evaluations of strong parametric assumptions. Their permutation principle
different FS techniques and classification over DNA partly alleviates the problem of small sample sizes in
microarray datasets focused on the univariate .This domination microarray studies and enhancing the robustness against
of the this approach can be explained by a number of reasons: outliers.
(a) The univariate feature rankings output is intuitive and easy 2) The multivariate paradigm for filter, wrapper and
to understand; embedded techniques
(b) the objectives and expectations that bio-domain experts Univariate selection methods have certain restrictions and
have when wanting to subsequently validate the result by it leads to less accurate classifiers by, e.g. not taking into
laboratory techniques or in order to explore literature searches account gene–gene interactions. Thus, researchers have
is fulfilled by the output of the gene ranking . The experts proposed techniques that try to capture these correlations
could not feel the need for selection techniques that take into between genes. Correlation-based feature selection (CFS) [9],
account gene interactions; and several variants of the Markov blanket filter method are
the application of multivariate filter methods ranges from

140 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

simple bivariate interactions towards more advanced solutions mass spectrometry based data mining is not comparable to the
exploring higher order interactions. The two other solid level of maturity reached in the microarray analysis domain,
multivariate filter procedures are Minimum Redundancy- an interesting collection of methods has been presented in the
Maximum Relevance (MRMR) [10], and Uncorrelated last 4–5 years.
Shrunken Centroid (USC) [11], algorithms highlighting the
advantage of using multivariate methods over univariate The following crucial steps is to extract the variables
procedures in the gene expression domain. that will constitute the initial pool of candidate discriminative
features and starting from the raw data, and after an initial step
Feature selection uses an alternative way to perform a to reduce noise and normalize the spectra from different
multivariate gene subset selection, incorporating the samples . Some studies employ the simplest approach of
classifier’s bias into the search and thus offering an considering every measured value as a predictive feature, thus
opportunity to construct more accurate classifiers. The scoring applying FS techniques over initial huge pools of about 15,000
function is another characteristic of any wrapper procedure variables, up to around 1,00,000 variables. The elaborated
and is used to evaluate each gene subset found. As the 0–1 peak detection and alignment techniques are the great deal of
accuracy measure allows for comparison with previous works, current studies performs aggressive feature extraction
the vast majority of papers use this measure. However, recent procedures. These procedures tend to seed the dimensionality
proposals advocate the use of methods for the approximation from which supervised FS techniques will start their work in
of the area under the ROC curve [12], or the optimization of less than 500 variables. To set the computational costs of
the LASSO (Least Absolute Shrinkage and Selection many FS techniques to a feasible size the feature extraction
Operator) model [13].For screening different types of errors in step is thus advisable in these MS scenarios. Univariate filter
many biomedical scenarios ROC curves certainly provide an techniques seem to be the most common techniques used
interesting evaluation measure. which is Similar to the domain of microarray analysis, even
though the use of embedded techniques is certainly emerging
The embedded capacity of several classifiers to discard as an alternative. The other parametric measures such as
input features and thus propose a subset of discriminative notable variety of non-parametric scores and F-Test have also
genes has been exploited by several authors. A random forest been used in several MS studies. Although the t-test maintains
(a classifier that combines many single decision trees) is an a high level of popularity. Multivariate filter techniques on the
example to calculate the importance of each gene. The weights other hand, are still somewhat underrepresented.
of each feature in linear classifiers, such as SVMs and logistic
regression are used by embedded FS techniques and these In MS studies Wrapper approaches have demonstrated
weights are used to reflect the relevance of each gene in a their usefulness by a group of influential works. in the major
multivariate way, and thus allow for the removal of genes with part of these papers different types of population-based
very small weights. randomized heuristics are used as search engines: genetic
algorithms [14], particle swarm optimization (Ressom et al.,
Due to the lesser degree embedded approaches and higher 2005) and ant colony procedures [15].To discard input
computational complexity of wrapper, these techniques have features an increasing number of papers uses the embedded
not received as much interest as filter proposals. However capacity of several classifiers. Variations of the popular
univariate filter method is an advisable practice to pre-reduce method originally proposed for gene expression domains using
the search space, and only then apply wrapper or embedded the weights of the variables in the SVM-formulation to discard
methods, hence fitting the computation time to the available features with small weights, have been broadly and
resources. successfully applied in the MS domain .Based on a similar
C. Mass Spectra Analysis framework, to rank the features by the weights of the input
For disease diagnosis and protein-based biomarker masses in a neural network classifier. The alternative
profiling the emerging new and attractive framework is the embedded FS strategy is the embedded capacity of random
Mass spectrometry technology (MS). A mass spectrum sample forests and other types of decision tree-based algorithms.
is characterized by thousands of different mass/charge (m/ z) IV. DEALING WITH SMALL SAMPLE DOMAINS
ratios on the x-axis, each with their corresponding signal
intensity value on the y-axis. A typical MALDI-TOF low- Small sample sizes and their over fitting and inherent risk
resolution proteomic profile can contain up to 15,500 data contain a great challenge for many modeling problems in
points in the spectrum between 500 and 20, 000 m/z, and the bioinformatics. Two initiatives have emerged in the context of
number of points even grows using higher resolution feature selection (i.e.) the use of adequate evaluation criteria,
instruments. and the use of stable and robust feature selection models in
response to this novel experimental situation.
For data mining and bioinformatics purposes, it can
initially be assumed that each m/ z ratio represents a distinct A. Adequate evaluation criteria
variable whose value is the intensity. The data analysis step is Several papers have warned about the substantial number
severely constrained by both high-dimensional input spaces of applications not performing an independent and honest
and their inherent sparseness, just as it is the case with gene validation of the reported accuracy percentages. In such cases,
expression datasets. Although the amount of publications on a discriminative subset of features is often selected by the

141 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

users using the whole dataset. This subset is used to estimate In the past few years several computational methods for
the accuracy of the final classification model thus testing the htSNP selection (haplotype SNPs; a set of SNPs located on
discrimination rule on samples that were already used to one chromosome) have been proposed. One approach is based
propose the final subset of features. The need for an external on the hypothesis that the human genome can be viewed as a
feature selection process in training the classification rule at set of discrete blocks that only share a very small set of
each stage of the accuracy estimation procedure is gaining common haplotypes. The aim of this approach is to identify a
space in the bioinformatics community practices. Furthermore, subset of SNPs that can either explain a certain percentage of
novel predictive accuracy estimation methods with promising haplotypes or atleast distinguish all the common haplotypes.
characteristics, such as bolstered error estimation have Another common htSNP selection approach is based on
emerged to deal with the specificities of small sample pairwise associations of SNPs, and tries to select a set of
domains. htSNPs such that each of the SNPs on a haplotype is highly
associated with one of the htSNPs [17]. The remaining SNPs
B. Ensemble feature selection approaches can be reconstructed and it is the third approach considering
An ensemble system, on the other hand is composed of a htSNPs as a subset of all SNPs. The idea is to select htSNPs
set of multiple classifiers and performs classification be based on how well they predict the remaining set of the
selecting from the predictions made by each of the classifiers. unselected SNPs.
Since wide research has shown that ensemble systems are
often more accurate than any of the individual classifiers of B. Text and literature mining
the system alone and it is only natural that ensemble systems It is the emerging as a promising area for data mining in
and feature selection would be combined at some point. biology. Text mining or text data mining, or text analytics,
refers to the process of deriving high-quality information from
Instead of choosing one particular FS method different FS text. Text mining usually involves the process of structuring
methods can be combined using ensemble FS approaches and the input text (usually parsing, along with the addition of some
accepting its outcome as the final subset Based on the derived linguistic features and the removal of others, and
evidence that there is often not a single universally optimal subsequent insertion into a database), deriving patterns within
feature selection technique and due to the possible existence of the structured data, and finally evaluation and interpretation of
more than one subset of features that discriminates the data the output. Bag-of-Words (BOW) representation is one
equally well [11], model combination approaches such as important representation of text and documents where the
boosting have been adapted to improve the robustness and variable represents each word in the text representation of the
stability of final, discriminative methods [16]. To assess the text may lead to very high dimensional datasets, pointing out
relevance of each feature in an ensemble FS the methods the need for feature selection techniques.
based on a collection of decision trees (e.g. random forests)
can be used. Although the use of ensemble approaches In the field of text classification the application of feature
requires additional computational resources, we would like to selection techniques is common and the application in the
point out that they offer an advisable framework to deal with biomedical domain is still in its infancy. A large number of
small sample domains, provided the extra computational feature selection techniques that were already developed in the
resources are affordable. text mining community for tasks such as biomedical document
clustering and classification and it will be of practical use for
V. FEATURE SELECTION IN UPCOMING DOMAINS researchers in biomedical literature mining .
A. Single nucleotide polymorphism analysis VI. FS SOFTWARE PACKAGES
A single-nucleotide polymorphism (SNP, pronounced
Table I shows an overview of existing software In order to
snip) is a DNA sequence variation occurring when a single
provide the interested reader with some pointers to existing
nucleotide — A, T, C, or G — in the genome (or other shared
software packages implementing a variety of feature selection
sequence) differs between members of a species or paired
methods. The software is organized into four sections: general
chromosomes in an individual. Single nucleotide
purpose FS techniques, techniques tailored to the domain of
polymorphisms (SNPs) are mutations at a single nucleotide
microarray analysis, techniques specific to the domain of mass
position that occurred during evolution and were passed on
spectra analysis and techniques to handle SNP selection and
through heredity, accounting for most of the genetic variation
all software packages mentioned are free for academic use.
among different individuals. SNPs are number being estimated
For each software package, the main reference,
at about 7 million in the human genome and it is the forefront
implementation language and website is shown.
of many disease-gene association studies. The important step
towards disease-gene association is selecting a subset of SNPs For each software package, the main reference,
that is sufficiently informative but still small enough to reduce implementation language and website is shown.
the genotyping overhead. Typically, the number of SNPs
considered is not higher than tens of thousands with sample
sizes of about 100.

142 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

TABLE I SOFTWARE FOR FEATURE SELECTION

General Purpose FS software

WEKA Java Witten and Frank(2005) http://www.cs.waikato.ac.nz/ml/weka


Fast Correlation Based Filter Java Yu and Liu(2004) http://www.public.asu.edu/~huanliu/FCBF/FCBFsoftware.html
MLC++ C++ Kohavi et al.(1996) http://www.sgi.com/tech/mlc
Feature selection Book Ansi C Liu and Motoda(1998) http://public.asu.edu/~huanliu/FSbook

Microarray analysis FS software

SAM R.Excel Tusher et al.(2001) http://www-stat.stanford.edu/~tibs/SAM/


PCP C,C++ Buturovic(2005) http://pcp.sourceforge.net
GALGO R Trevino & Falciani(2006) http://www.bip.bham.ac.uk/bioinf/galgo.html
GA-KNN C Li et al(2001) http://dir/niehs.nih.gov/microarray/datamining/
Nudge(Bioconductor) R Dean & Raftery(2005) http://www.bioconductor.org/
Qvalue(Bioconductor) R Storey(2002) http://www.bioconductor.org/
DEDS(Bioconductor) R Yang et.al(2005) http://www.bioconductor.org/

Mass Spectra analysis FS software

GA-KNN C Li et al(2004) http://dir.niehs.nih.gov/microarray/datamining/


R-SVM R,C,C++ Zhang et al.(2006) http://www.hsph.harvard.edu/bioinfocore/RSVMhome/R-SVM.html

SNP analysis FS software

CHOISS C++, Perl Lee and Kang(2004) http://biochem.kaist.ac.kr/choiss.htm


WCLUSTAG Java Sham et al.(2007) http://bioinfo.hku.hk/wclustag
SNPs, text and literature mining, and the combination of
VII. CONCLUSIONS AND FUTURE PERSPECTIVES heterogeneous data sources are the other interesting
In this article, it is reviewed the main contributions of opportunities for future FS research will be the extension
feature selection research in a set of well-known towards upcoming bioinformatics domains. While in these
bioinformatics applications. Te large input dimensionality and domains, the FS component is not yet as central as, e.g. in
the small sample sizes are the two main issues emerge as gene expression or MS areas, I believe that its application will
common problems in the bioinformatics domain. Researchers become essential in dealing with the high-dimensional
designed FS techniques to deal with these problems in character of these applications.
bioinformatics, machine learning and data mining.
ACKNOWLEDGMENT
During the last years a large and fruitful effort has I would like to thank the anonymous reviewers for their
been performed in the adaptation and proposal of univariate constructive comments, which significantly improved the
filter FS techniques. In general, it is observed that many quality of this review.
researchers in the field still think that filter FS approaches are
only restricted to univariate approaches. The proposal of REFERENCES
multivariate selection algorithms can be considered as one of
[1] Daelemans,W.,et al.(2003) Combined optimization of feature seelection
the most promising future lines of work for the bioinformatics and algorithm parameter interaction in machine learning of language. In
community. Proceedings of the 14th European Conference on Machine Learning
(ECML – 2003), pp. 84-95.
A second line of future research is The development of [2] Salzberg., et.al(1998) Microbial gene identification using interpolated
especially fitted ensemble FS approaches to enhance the markov models. Nucleic Acids Res., 26, 544–548.
robustness of the finally selected feature subsets is the second [3] Saeys,Y., et al. (2007) In search of the small ones: improved prediction
line of future research. In order to alleviate the actual small of short exons in vertebrates, plants, fungi, and protists. Bioinformatics,
sample sizes of the majority of bioinformatics applications, the 23, 414–420.
further development of such techniques, combined with [4] Keles,S., et al. (2002) Identification of regulatory elements using a
appropriate evaluation criteria, constitutes an interesting feature selection method. Bioinformatics, 18, 1167–1175.
direction for future FS research. [5] Jafari,P. and Azuaje,F. (2006) An assessment of recently published gene

143 | P a g e
http://ijacsa.thesai.org/
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No.1, January 2011

expression data analyses: reporting experimental design and statistical [15] Ressom,H., et al. (2007) Peak selection from MALDI-TOF mass spectra
factors,BMC Med. Inform. Decis. Mak., 6, 27. using ant colony optimization. Bioinformatics, 23, 619–626.
[6] Thomas,J., et al. (2001) An efficient and robust statistical modeling [16] Ben-Dor,A., et al. (2000) Tissue classification with gene expression
approach to discover differentially expressed genes using genomic profiles. J. Comput. Biol., 7, 559–584
expression profiles.Genome Res., 11, 1227–1236. [17] Carlson,C., et al. (2004) Selecting a maximally informative set of single-
[7] Dudoit,S., et al. (2002) Comparison of discriminant methods for the nucleotide polymorphisms for association analyses using linkage
classification of tumors using gene expression data. J. Am. Stat. Assoc., disequilibrium. Am. J. Hum. Genet., 74, 106–120..
97, 77–87. [18] Margaret H.Dunham S.Sridhar (2008) Data Mining Introductory and
[8] Breitling,R., et al. (2004) Rank products: a simple, yet powerful, new Advanced Topics.
method todetect differentially regulated genes in replicated microarray [19] Efron,B., et al. (2001) Empirical Bayes analysis of a microarray
experiments.FEBS Lett., 573, 83–92. experiment. J. Am. Stat. Assoc., 96, 1151–1160.
[9] Wang,Y., et al. (2006) Tumor classification based on DNA copy number [20] Kohavi,R., et al. (1996) Data mining using MLC++: a machine learning
aberrations determined using SNPS arrays. Oncol. Rep., 5, 1057–1059. library in C++. In Tools with Artificial Intelligence, IEEE Computer
[10] Ding,C. and Peng,H. (2003) Minimum redundancy feature selection Society Press, Washington, DC, pp. 234–245.
from microarray gene expression data. In Proceedings of the IEEE [21] Inza,I., et al. (2000) Feature subset selection by Bayesian networks
Conference on Computational Systems Bioinformatics, pp. 523–528 based optimization. Artif. Intell., 123, 157–184.
[11] Yeung,K. and Bumgarner,R. (2003) Multiclass classification of [22] Sofie Van Landeghem (2008), Extracting Protein –Protein Interactions
microarray data with repeated measurements: application to cancer. from Text using Rich Feature Vectors and Feature Selection. 77-84.
Genome Biol., 4, R83. [23] Michael Gutkin (2009) , A method for feature selection in gene
[12] Ma,S. and Huang,J. (2005) Regularized ROC method for disease expression-based disease classification.
classification and biomarker selection with microarray data. [24] Varshavsky,R., et al. (2006) Novel unsupervised feature filtering of
Bioinformatics, 21, 4356–4362. biological data. Bioinformatics, 22, e507–e513.
[13] Ghosh,D. and Chinnaiyan,M. (2005) Classification and selection of [25] Jake Y. Chen , Stefano Lonardi (2009), Biological Data Mining.
biomarkers in genomic data using LASSO. J. Biomed. Biotechnol., [26] Pawel Smialowski, Bioinformatics (2010), Pitfalls of supervised feature
2005,147–154. selection,oxford journals 26(3): 440-443.
[14] Li,T., et al. (2004) A comparative study of feature selection and
multiclass classification methods for tissue classification based on gene
expression . Bioinformatics, 20, 2429–2437.

144 | P a g e
http://ijacsa.thesai.org/

You might also like