@ailib Robust Feature Selection Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Proc. of the 1993 IEEE t.

me/ailib
Int.'l Cod. on Tools with AI,
Boston, Massachusetts, Nov. 1993
Robust Feature Selection Algorithms
Haleh Vafaie and Kenneth De Jong
Center for Artificial Intelligence
George Mason University
Fairfax, VA 22030

Abstract be several hundred candidate features and complex


interactions among the features. On such problems
Selecting a set of features which is optimal for a given conventional features selection algorithms run fast, but
task is a problem which plays an important role in a wide produce results which vary dramatically in quality from
variety of contexts including pattern recognition, adaptive quite good to poor. As a consequence, we have explored
control, and machine learning. Our experience with the use of genetic algorithms (GAS) as a means for
traditional feature selection algorithms in the domain of improving the robustness of such algorithms without
machine learning lead to an appreciation for their sacrificing too much in speed. We present our initial
computational eficiency and a concernfor their brittleness. results in the following sections.
This paper describes an alternate approach to feature
selection which uses genetic algorithms as the primary 2. Feature Selection Techniques
search component. Results are presented which suggest
that genetic algorithms can be used to increase the Since each feature used can increase the cost and
robustness of feature selection algorithms without a running time of a system, there is strong motivation to
significant decrease in computational efficiency. design and implement systems with small feature sets. At
the same time there is a potentially opposing need to
include a sufficient set of features to achieve acceptably
1. Introduction high performance. This has led to the development of a
variety of search techniques for finding an "optimal" subset
Finding an optimal set of features from a large of features from a larger set of possible features.
set of candidate features is a problem which occurs in Exhaustively trying all the subsets is computationally
many contexts. In modeling one would like to identify prohibitive when there are a large number of features.
and exploit minimal models of the phenomena under There are two main approaches to avoiding the
study. In control theory, minimizing the number of combinatorial explosion as the number of candidate
control parameters is an important part of the design features grows. The first involves developing problem
process. In image understanding and machine learning, specific strategies (heuristics) which use domain
finding a minimal set of features necessary for recognition knowledge to prune the feature space to a manageable size
and classification is a key part of designing efficient and [ 5 ] . The second approach is to use generic heuristics
implementable systems. (primarily hill-climbing algorithms) when domain
In recent years there has been a significant knowledge is costly to exploit or unavailable [8].
increase in the size and complexity of the problems In the case of texture recognition, we found
undertaken in these and other areas. Along with this has ourselves in the second camp: lots of possible features, but
come an increase in the size and complexity of the little in the way of domain knowledge to assist the search
corresponding feature selection problems, resulting in process. We adopted a feature selection algorithm from
'
~

renewed efforts to automate the process of feature the literature which involved the basic components I

selection, and in some cases, the additional requirement illustrated in Figure 1. The search procedure is a classic ~

that feature selection occur "on line" because of "greedy" hill-climbing algorithm, sequential backward I

dynamically changing environments. selection (SBS), which removes one feature at a time until
The work presented here was motivated by our no improvement in the criterion function is obtained. The
experiences in using conventional feature selection criterion function is problem specific and can vary from
algorithms for difficult machine learning problems simple performance measures to complex multistage
involving texture recognition. In our case there can easily evaluation procedures.

356
1063-6730/93$03.00 0 1993IEEE
.........................
e feature Best feature
subset
I -
technique
4
goodness
feature
subset

-I I
Of
Criterion
fnmction 4-

........................ I

Figure 1: Block diagram of the feature selection proicess

Our experience with the SBS approach suggests level. This is both the source of the power and the
that it is very fast, but also quite brittle in the sense that brittleness of the algorithm. Its running time is clearly no
the quality of the results varies widely across data sets. In worse than N2 in terms of the number of different feature
h e following sections we explore in more detail the cause sets evaluated before termination. It achieves this,
of the brittleness, and develop a more robust approach by Ihowever, by ignoring any possible interactions among the
replacing the given search procedure with a genetic features. Hence, it is possible for no single feature to
algorithm. yield an improvement (thus terminating the search) while
the simultaneous removal of 2 or 3 features might result
3. Sources of Brittleness in significant improvements. Similarly, a member of the
optimal feature set can be removed early in the search
The SBS algorithm is a search procedure that process to obtain some initial improvement, but then be
starts with the complete set of features, and discards one unavailable for later more significant improvements. In
feature at a time until the desired number of features have addition, if the criterion function contains any imprecision
been reached [4]. At a particular stage, there are m features or noise, such greedy algorithms become even less
remaining. To determine which (if any) feature to remove reliable.
next, each of the m features are evaluated as a candidate for Clearly, any attempt at improving the situation
removal by temporarily removing it and computing the by looking at all second order and possibly third order
effects via the criterion function. The feature who's feature interactions quickly becomes computasionally
removal leads to the largest improvement (lowest decrease prohibitive. An alternative which seemed quite hopeful
in the criterion function) is removed permanently. The was the use of genetic algorithms which are best known
process is then repeated for the remaining m-1 features for their ability to efficiently search large spaces about
until no improvementsare obtained. which little is known, and which are relatively insensitive
An insightful way to visualize this process is to to noise. In the next section this approach is described in
represent the entire feature set as a binary string of N more detail.
boolean variables, each representing the presence or
absence of the ith feature. As illustrated in Figure 2, the 4. Feature Selection Using Genetic
SBS algorithm starts with the string of all ones, and Algorithms
moves up the lattice one level at a time by selecting the
node at the next level which results in maximal Genetic algorithms (GAS), a form of inductive
improvement. learning strategy, are adaptive search techniques initially
Notice that for problems with N > 2, committing irntroduced by Holland [7]. Genetic algorithms derive their
to a particular node generally makes other higher level name from the fact that their operations are similar to the
nodes in the lattice inaccessible. For example, mechanics of genetic models of natural systems.
committing to 011 makes 100 inaccessible at the next

357
t.me/ailib

a) Two feature lattice b) Three feature lattice

Figure 2: Feature Set Lattices

Genetic algorithms typically maintain a constant- Genetic algorithms have demonstrated substantial
sized population of individuals which represent samples of improvement over a variety of random and local search
the space to be searched. Each individual is evaluated on methods [2]. This is accomplished by their ability to
the basis of its overall fitness with respect to the given exploit accumulating information about an initially
application domain. New individuals (samples of the unknown search space in order to bias subsequent search
search space) are produced by selecting high performing into promising subspaces. Since GAS are basically a
individuals to produce "offspring" which retain many of domain independent search technique, they are ideal for
the features of their "parents". This eventually leads to a applications where domain knowledge and theory is
population that has improved fitness with respect to the difficult or impossible to provide 131.
given goal. The main issues in applying GAS to any
New individuals (offspring) for the next problem are selecting an appropriate representation and an
generation are formed by using two main genetic adequate evaluation function. Since GAS require the
operators, crossover and mutation. Crossover operates by same kind of evaluation function as the SBS algorithm, it
randomly selecting a point in the two selected parents gene can be used without modification. The natural
structures and exchanging the remaining segments of the representation for the feature selection problem is precisely
parents to create new offspring. Therefore, crossover the one described earlier, name a binary string of length N
combines the features of two individuals to create two representing the presence or absence of each of the N
similar offspring. Mutation operates by randomly possible features. The advantage of this representation is
changing one or more components of a selected individual. that the classical GA's operators as described before
It acts as a population perturbation operator and is a means (binary mutation and crossover) can easily be applied to
for inserting new information into the population. This this representation without any modification. This
operator prevents any stagnation that might occur during eliminates the need for designing new genetic operators,
the search process. or making any other changes to the standard form of
genetic algorithms.

358
5. Experimental Results
This can be further simplified by letting m=f+2b. Our goal
In performing the experiments reported here, the m then be achieved by requiring:
a
SBS algorithm was used as described above. For the GA
approach, GENESIS [6], a general purpose genetic a > 0, a > b, a > m (global optimum at 100)
algorithm program, was used with the standard parameter ml
:settings recommended in [2]: a population size=50, a m > a+b, m > a+m+g (local optimum at 011)
mutation rate= 0.001, and a crossover rateO.6. The results
presented here show the average performance of ten runs. The constraints a > m and m > a+b imply that b < 0. If
we arbitrarily let b=-4, then m = 2b+f = f-8. For
5.1 Artificial Problem Sets simplicity we can set f=9 resulting in m=l, and assign
g=-5yielding:
The first set of experiments involved constructing
a family of artificial problems to test our hypotheses about
the sources of brittleness of the greedy feature selection
algorithm and to compare its performance with the GA
approach. The basic idea involved constructing families of
functions in which the degree of interaction among features X0.X 1,x2 f(X0,X 172)
could be controlled so as to make the problem either easier 000 0
cr harder for the greedy algorithm. 100 a (global optimum)
As indicated earlier in Figure 2, three features 001 4
obits) are the minimal size problem in which non-linear 010 4
interactions can cause problems. We can construct a 011 1 (local optimum)
fiinction with the desired properties by thinking of it as a 101 a-4
fiinction of 3 boolean variables (one for each feature) and 110 a-4
writing out all the first, second and third order terms: 111 a-4

f(xo,x1,x2) = a*xo + b*xi + c*x2 By varying the value of a > 0, we can test our hypothesis
+ d*xo*xl+ e*xo*x2 + f*xl*x2 (1) about the source of brittleness of the SBS algorithm.
+ g*xo*x1*x2 While 0 < a < 1, the feature set 011 is the global
optimum and should be easily found. When 1 e a < 5,
In general, not all the terms are needed to produce the feature set 011 is the local optimum and the global
sufficient interaction to cause problems. In this example, optimum 100 should never be found. When a >= 5, the
we seek a function which (referring back to Figure 2) feature set 111 is a local optimum, and the global
assigns the optimal value to 100, but causes the SBS optimum should never be found.
algorithm to select 011 on the first round. This can be In the first experiment, we added to this minimal
easily achieved in a variety of ways. For example, we can 3-feature dependency problem an additional 27 features
let b=c and d=e=O. Then (1) simplifies to: which had no effect on the criterion function, but served to
increase the size of the search space significantly (230).
f(xo,xi,x2)= a*xO + b*xi + b*x2 Figure 3 presents the results for varying a between 0 and
+ f*Xl*X2 +g*xo*x1*x2 (2) 10. The behavior of the SBS atlgorithm was as expected,
finding the global optimum only when a<l. For this
This results in the following assignments of "value" to all simple case, the GA approach consistently found the
possible subsets of the 3 features: global optimum.
X0,X 19x2 f(X0,X172) To make things more difficult, we changed the
OOO 0 problem slightly by replicating the 3-feature dependency 3
100 a (global optimum) times and then included 21 adlditional bits as before to
001 b create a total search space of 230. Figure 4 shows the
010 b results of applying both algorithims to this problem with a
011 2b+f (local optimum) again ranging from 0 to 10. In this case the GA did not
101 a+b always find the global optimum (it's a heuristic as well),
110 a+b but again significantly outperformed SBS when a > 1.
111 a+2b+f+g

359
12.5
*a
40 ,
n
.CI 30
N

-
se,
d
X
3 20
0
K
W
cc
=wq0

0
z : 9
m 2 9
0
3

Coefficient a repetition

Figure 3: The comparison results of varying a Figure 5: Varying the number of repetitionsof
3-feature dependencies

12.5 As a final experiment in this artificial domain, we


constructed m-feature dependencies by overlapping 3-
+ G A
feature problems (i.e., shared features). Figure 6 presents
the results of this experiment. Again, as expected, the GA
approach performs significantly better as the number of
dependencies grows.

20
* G A

0
2 'c!
k-
0
M 2
Coefficient a

Figure 4: The comparison results of varying a

To see more clearly the effects of increasing the 0 0 m 3


0 v
l F 0
number of 3-feature dependencies would have on the two 3

algorithms, we fixed a = 4 and varied the number of 3-


feature dependenciesfrom 1 to 10, in each case filling in as dependencies
before with additional extraneous bits to keep the search
space at 230. The results are shown in Figure 5 and
indicate quite clearly that'the advantage of a GA approach Figure 6: Varying the number of
increases with the number of such dependencies. interdependencies

360
These initial results served to clarify our measure [4]. More specifically, the function to be
understanding of the two algorithms and provided insight maximized is:
into the kinds of behavior seen in earlier experiments
using real data, namely, that the high variance in the
performance of the greedy algorithm was due to feature
interdependencies, and that a GA approach would be more
robust in general, but less efficient when there were few or
no such interactions. In the following section we describe
two classes of experiments that support this view. denotes the niumber of classes
are testing examples from class i and j
5.2 Realistic Problem Sets respectively
represents the Euclidean distance between
The first example is based on texture images that two elements
were randomly selected from Brodatz album of textures [ll. denote the number of training examples
These images are depicted in Figure 7. One hundred for the class I' and j respectively
fcature vectors, each containing 18 features were then are the probalbilities for the class i and j
nindomly extracted from an arbitrary selected area of 30 by respectively
30 pixels from each of the chosen textures.The goal of this
experiment was to find an optimal set of features to be Figure 8 shows the results of a typical experiment on this
used by AQl5 in order to induce texture classificationrules data. J(e) induced relatively little interaction among the
(for a detailed description, see [9]). In order to obtain features, so both algorithms find the optimal subset, but
precise measurements of the classification accuracy SBS requires fewer trials (executions of the given criterion
associated with a particular feature set, AQl5 had to be run function).
to produce the rules, and then the rules had to be tested for Unfortunately, the heuristic evaluation function is
accuracy. Since this can be a computationally expensive not all that good, and so the improvement of the actual
process for large data sets, a heuristic evaluation function recognition rate of the AQ15-produced rules was not
taken from the feature selection literature was used instead. sbrongly correlated with improvements in J(e).
The heuristic involves selecting feature subsets which
maximally separate classes using a Euclidean distance

Figure 7: The texture images used In this set of experiments.

361
6. Summary and Conclusions

The goal of the research reported here was to


1 5 1 understand better the observed brittleness of traditional
10
feature selection algorithms and to use that knowledge to
develop more robust approaches. The results suggest that
A
a the source of the brittleness is a tendency to get trapped on
W
c, local peaks caused by interdependencies among features.
5 Extending these algorithms directly to avoid such local
minima was viewed as computationally prohibitive.
Rather, a fairly straightforward implementation of genetic
........0........ SBS algorithms proved quite effective in improving the
0 I I I robustness of feature selection over a range of problems
0 0 0 0 0 without significant increases in computational complexity.
0 0 0 0
r-4 d v3 01 At the same time, it should be noted that the
traditional approach is more efficient when the number of
trials interacting features is small. An interesting open question
is whether a multistrategy approach could be developed
Figure 8: The comparison results of which could combine the two approaches since, in general,
performance over time information about the degree of interactions is generally
not available a priori.
In our second example we used a smaller data set
(breast cancer data) and used the more computationally Acknowledgments
expensive way to evaluate feature subsets: by running
AQl5 for each feature set to be evaluated and measuring This research was conducted in the Center for
the classification accuracy of the rules produced on the test Artificial Intelligence at George Mason University. The
data. As one might expect, the combination of the rule Center’s research are supported in part by Advanced
induction process and the classification evaluation function Research Projects Agency under grant No. N00014-91-J-
produced rather strong interactions among the features. 1854, administrated by the office of Naval Research, and
Figure 9 show the results of a typical experiment in this under the grant No. F49620-92-J-0549, administered by
context. The robustness of the GA approach is quite the Air Force Office of ScientificResearch, and in part by
evident here. the Office of Naval Research under grant No.
N00014-91-J-1351, and in part by the National Science
Foundation under grant No. IRI-9020266.
C
cd 50 References
E
..
......
0e......9.......0........0 [l] Brodatz, P. “A Photographic Album for Arts and Design,”
Dover Publishing Co., Toronto, Canada, 1966.

[2] De Jong, K. “Analysis of the behavior of a class of


genetic adaptive systems,” Ph.D. Thesis, Department of
Computer and Communications Sciences, University of
U W Michigan, Ann Arbor, MI., 1975.

[3] De Jong, K. “Learning with Genetic Algorithms : An


overview,” Machine Learning Vol. 3 , Kluwer Academic
publishers, 1988.
b o 0 0 0 0 0
~ 0 0 0 0 0
m m w m w [4] Devijver, P., and Kittler, I. “PATTERN RECOGNITION: A
STATISTICAL APPROACH,” Prentice Hall, 1982.
trials
[5] Dom, B., Niblack, W., and Sheinvald, J. “Feature
selection with stochastic complexity,” Proceedings of IEEE
Figure 9: The comparison results of Conference on Computer Vision and Pattern Recognition,
performance over time Rosernont, IL., 1989.

362
[6] Grefenstette. John J. Technical Report CS-83-11,
Computer Science Dept., Vanderbilt Univ., 1984.

[7] Holland, J. H..“Adaptation in Natural and Artgicial


Systems,” University of Michigan Press, Ann Arbor. MI.,
1975.

[8] Kittler, J. ” Feature set search algorithms,” in Patfern


Recognition and Signal Processing, C.H. Chen, Ed., Sijthoff
and Noordhoff, The Netherlands, 1978.

[9] Vafaie, H., and De Jong, K.A., “Improving the


performance of a Rule Induction System Using Genetic
Algorithms.” Proceedings of the First International
Workshop on MULTISTRATEGY LEARNING, Harpers Ferry,
W.Virginia, USA, 1991.

363

You might also like