@ailib Robust Feature Selection Algorithms
@ailib Robust Feature Selection Algorithms
@ailib Robust Feature Selection Algorithms
me/ailib
Int.'l Cod. on Tools with AI,
Boston, Massachusetts, Nov. 1993
Robust Feature Selection Algorithms
Haleh Vafaie and Kenneth De Jong
Center for Artificial Intelligence
George Mason University
Fairfax, VA 22030
renewed efforts to automate the process of feature the literature which involved the basic components I
selection, and in some cases, the additional requirement illustrated in Figure 1. The search procedure is a classic ~
that feature selection occur "on line" because of "greedy" hill-climbing algorithm, sequential backward I
dynamically changing environments. selection (SBS), which removes one feature at a time until
The work presented here was motivated by our no improvement in the criterion function is obtained. The
experiences in using conventional feature selection criterion function is problem specific and can vary from
algorithms for difficult machine learning problems simple performance measures to complex multistage
involving texture recognition. In our case there can easily evaluation procedures.
356
1063-6730/93$03.00 0 1993IEEE
.........................
e feature Best feature
subset
I -
technique
4
goodness
feature
subset
-I I
Of
Criterion
fnmction 4-
........................ I
Our experience with the SBS approach suggests level. This is both the source of the power and the
that it is very fast, but also quite brittle in the sense that brittleness of the algorithm. Its running time is clearly no
the quality of the results varies widely across data sets. In worse than N2 in terms of the number of different feature
h e following sections we explore in more detail the cause sets evaluated before termination. It achieves this,
of the brittleness, and develop a more robust approach by Ihowever, by ignoring any possible interactions among the
replacing the given search procedure with a genetic features. Hence, it is possible for no single feature to
algorithm. yield an improvement (thus terminating the search) while
the simultaneous removal of 2 or 3 features might result
3. Sources of Brittleness in significant improvements. Similarly, a member of the
optimal feature set can be removed early in the search
The SBS algorithm is a search procedure that process to obtain some initial improvement, but then be
starts with the complete set of features, and discards one unavailable for later more significant improvements. In
feature at a time until the desired number of features have addition, if the criterion function contains any imprecision
been reached [4]. At a particular stage, there are m features or noise, such greedy algorithms become even less
remaining. To determine which (if any) feature to remove reliable.
next, each of the m features are evaluated as a candidate for Clearly, any attempt at improving the situation
removal by temporarily removing it and computing the by looking at all second order and possibly third order
effects via the criterion function. The feature who's feature interactions quickly becomes computasionally
removal leads to the largest improvement (lowest decrease prohibitive. An alternative which seemed quite hopeful
in the criterion function) is removed permanently. The was the use of genetic algorithms which are best known
process is then repeated for the remaining m-1 features for their ability to efficiently search large spaces about
until no improvementsare obtained. which little is known, and which are relatively insensitive
An insightful way to visualize this process is to to noise. In the next section this approach is described in
represent the entire feature set as a binary string of N more detail.
boolean variables, each representing the presence or
absence of the ith feature. As illustrated in Figure 2, the 4. Feature Selection Using Genetic
SBS algorithm starts with the string of all ones, and Algorithms
moves up the lattice one level at a time by selecting the
node at the next level which results in maximal Genetic algorithms (GAS), a form of inductive
improvement. learning strategy, are adaptive search techniques initially
Notice that for problems with N > 2, committing irntroduced by Holland [7]. Genetic algorithms derive their
to a particular node generally makes other higher level name from the fact that their operations are similar to the
nodes in the lattice inaccessible. For example, mechanics of genetic models of natural systems.
committing to 011 makes 100 inaccessible at the next
357
t.me/ailib
Genetic algorithms typically maintain a constant- Genetic algorithms have demonstrated substantial
sized population of individuals which represent samples of improvement over a variety of random and local search
the space to be searched. Each individual is evaluated on methods [2]. This is accomplished by their ability to
the basis of its overall fitness with respect to the given exploit accumulating information about an initially
application domain. New individuals (samples of the unknown search space in order to bias subsequent search
search space) are produced by selecting high performing into promising subspaces. Since GAS are basically a
individuals to produce "offspring" which retain many of domain independent search technique, they are ideal for
the features of their "parents". This eventually leads to a applications where domain knowledge and theory is
population that has improved fitness with respect to the difficult or impossible to provide 131.
given goal. The main issues in applying GAS to any
New individuals (offspring) for the next problem are selecting an appropriate representation and an
generation are formed by using two main genetic adequate evaluation function. Since GAS require the
operators, crossover and mutation. Crossover operates by same kind of evaluation function as the SBS algorithm, it
randomly selecting a point in the two selected parents gene can be used without modification. The natural
structures and exchanging the remaining segments of the representation for the feature selection problem is precisely
parents to create new offspring. Therefore, crossover the one described earlier, name a binary string of length N
combines the features of two individuals to create two representing the presence or absence of each of the N
similar offspring. Mutation operates by randomly possible features. The advantage of this representation is
changing one or more components of a selected individual. that the classical GA's operators as described before
It acts as a population perturbation operator and is a means (binary mutation and crossover) can easily be applied to
for inserting new information into the population. This this representation without any modification. This
operator prevents any stagnation that might occur during eliminates the need for designing new genetic operators,
the search process. or making any other changes to the standard form of
genetic algorithms.
358
5. Experimental Results
This can be further simplified by letting m=f+2b. Our goal
In performing the experiments reported here, the m then be achieved by requiring:
a
SBS algorithm was used as described above. For the GA
approach, GENESIS [6], a general purpose genetic a > 0, a > b, a > m (global optimum at 100)
algorithm program, was used with the standard parameter ml
:settings recommended in [2]: a population size=50, a m > a+b, m > a+m+g (local optimum at 011)
mutation rate= 0.001, and a crossover rateO.6. The results
presented here show the average performance of ten runs. The constraints a > m and m > a+b imply that b < 0. If
we arbitrarily let b=-4, then m = 2b+f = f-8. For
5.1 Artificial Problem Sets simplicity we can set f=9 resulting in m=l, and assign
g=-5yielding:
The first set of experiments involved constructing
a family of artificial problems to test our hypotheses about
the sources of brittleness of the greedy feature selection
algorithm and to compare its performance with the GA
approach. The basic idea involved constructing families of
functions in which the degree of interaction among features X0.X 1,x2 f(X0,X 172)
could be controlled so as to make the problem either easier 000 0
cr harder for the greedy algorithm. 100 a (global optimum)
As indicated earlier in Figure 2, three features 001 4
obits) are the minimal size problem in which non-linear 010 4
interactions can cause problems. We can construct a 011 1 (local optimum)
fiinction with the desired properties by thinking of it as a 101 a-4
fiinction of 3 boolean variables (one for each feature) and 110 a-4
writing out all the first, second and third order terms: 111 a-4
f(xo,x1,x2) = a*xo + b*xi + c*x2 By varying the value of a > 0, we can test our hypothesis
+ d*xo*xl+ e*xo*x2 + f*xl*x2 (1) about the source of brittleness of the SBS algorithm.
+ g*xo*x1*x2 While 0 < a < 1, the feature set 011 is the global
optimum and should be easily found. When 1 e a < 5,
In general, not all the terms are needed to produce the feature set 011 is the local optimum and the global
sufficient interaction to cause problems. In this example, optimum 100 should never be found. When a >= 5, the
we seek a function which (referring back to Figure 2) feature set 111 is a local optimum, and the global
assigns the optimal value to 100, but causes the SBS optimum should never be found.
algorithm to select 011 on the first round. This can be In the first experiment, we added to this minimal
easily achieved in a variety of ways. For example, we can 3-feature dependency problem an additional 27 features
let b=c and d=e=O. Then (1) simplifies to: which had no effect on the criterion function, but served to
increase the size of the search space significantly (230).
f(xo,xi,x2)= a*xO + b*xi + b*x2 Figure 3 presents the results for varying a between 0 and
+ f*Xl*X2 +g*xo*x1*x2 (2) 10. The behavior of the SBS atlgorithm was as expected,
finding the global optimum only when a<l. For this
This results in the following assignments of "value" to all simple case, the GA approach consistently found the
possible subsets of the 3 features: global optimum.
X0,X 19x2 f(X0,X172) To make things more difficult, we changed the
OOO 0 problem slightly by replicating the 3-feature dependency 3
100 a (global optimum) times and then included 21 adlditional bits as before to
001 b create a total search space of 230. Figure 4 shows the
010 b results of applying both algorithims to this problem with a
011 2b+f (local optimum) again ranging from 0 to 10. In this case the GA did not
101 a+b always find the global optimum (it's a heuristic as well),
110 a+b but again significantly outperformed SBS when a > 1.
111 a+2b+f+g
359
12.5
*a
40 ,
n
.CI 30
N
-
se,
d
X
3 20
0
K
W
cc
=wq0
0
z : 9
m 2 9
0
3
Coefficient a repetition
Figure 3: The comparison results of varying a Figure 5: Varying the number of repetitionsof
3-feature dependencies
20
* G A
0
2 'c!
k-
0
M 2
Coefficient a
360
These initial results served to clarify our measure [4]. More specifically, the function to be
understanding of the two algorithms and provided insight maximized is:
into the kinds of behavior seen in earlier experiments
using real data, namely, that the high variance in the
performance of the greedy algorithm was due to feature
interdependencies, and that a GA approach would be more
robust in general, but less efficient when there were few or
no such interactions. In the following section we describe
two classes of experiments that support this view. denotes the niumber of classes
are testing examples from class i and j
5.2 Realistic Problem Sets respectively
represents the Euclidean distance between
The first example is based on texture images that two elements
were randomly selected from Brodatz album of textures [ll. denote the number of training examples
These images are depicted in Figure 7. One hundred for the class I' and j respectively
fcature vectors, each containing 18 features were then are the probalbilities for the class i and j
nindomly extracted from an arbitrary selected area of 30 by respectively
30 pixels from each of the chosen textures.The goal of this
experiment was to find an optimal set of features to be Figure 8 shows the results of a typical experiment on this
used by AQl5 in order to induce texture classificationrules data. J(e) induced relatively little interaction among the
(for a detailed description, see [9]). In order to obtain features, so both algorithms find the optimal subset, but
precise measurements of the classification accuracy SBS requires fewer trials (executions of the given criterion
associated with a particular feature set, AQl5 had to be run function).
to produce the rules, and then the rules had to be tested for Unfortunately, the heuristic evaluation function is
accuracy. Since this can be a computationally expensive not all that good, and so the improvement of the actual
process for large data sets, a heuristic evaluation function recognition rate of the AQ15-produced rules was not
taken from the feature selection literature was used instead. sbrongly correlated with improvements in J(e).
The heuristic involves selecting feature subsets which
maximally separate classes using a Euclidean distance
361
6. Summary and Conclusions
362
[6] Grefenstette. John J. Technical Report CS-83-11,
Computer Science Dept., Vanderbilt Univ., 1984.
363