Navot PHD

On the Role of Feature Selection in
Machine Learning
Thesis submitted in partial fulllment of the degree of

Doctor of Philosophy
by
Amir Navot
Submitted to the Senate of the Hebrew University
December 2006
ii
iii
This work was carried out under the supervision of Prof. Naftali Tishby.
iv
Acknowledgments
Many people helped me in many ways over the course of my Ph.D. studies and I would
like to take this opportunity to thank them all. A certain number of people deserve special
thanks and I would like to express my gratitude to them with a few words here. The rst
is my supervisor Naftali (Tali) Tishby who taught me a great deal and always supported
me, even when I put forward extremely wild and ill-thought out ideas. Tali played a ma-
jor part in shaping my scientic point of view. The second person is Ran Gilad-Bachrach,
who was both like a second supervisor and a good friend. He guided my rst steps as a
Ph.D. student, and was always ready to share ideas and help with any issue. Our many
discussions on everything were the best part of my Ph.D. studies. My other room mates,
Amir Globerson and Gal Chechik deserve thanks for all the help and the inspiring discus-
sions. Lavi Shpigelman, a good friend and a great classmate, was always ready to help on
everything from a worthy scientic question to a very technical problem with my computer.
Aharon Bar-Hillel challenged me with tough questions but also helped me nd the answers.
My brother, Yiftah Navot, has been my longstanding mentor and is always there to help
me with any mathematical problem. There is no way I can express how much I value his
assistance. My mother always made it very clear that studies are the most important thing.
Of course my wife, Noa, without whose endless support I never would have nished (or
probably have never even started) my Ph.D. studies. Finally I would also like to thank
Esther Singer for her eorts to improve my English, and all the administrative sta of both
the ICNC and CS who were always kind to me and were always willing to lend a hand. I
also thank the Horowitz foundation for the generous funding they have provided me.
v
vi
Abstract
This thesis discusses dierent aspects of feature selection in machine learning, and more
specically for supervised learning. In machine learning the learner (the machine) uses
a training set of examples in order to build a model of the world that enables reliable
predictions. In supervised learning each training example is an (instance, label) pair, and
the learner's goal is to be able to predict the label of a new unseen instance with only a small
chance of erring. Many algorithms have been suggested for this task, and they work fairly
well on many problems; however, their degree of success depends on the way the instances
are represented. Most learning algorithms assume that each instance is represented by a
vector of real numbers. Typically, each number is a result of a measurement on the instance
(e.g. the gray level of a given pixel of an image). Each measurement is a feature. Thus a
key question in machine learning is how to represent the instances by a vector of numbers
(features) in a way that enables good learning performance. One of the requirements of a
good representation is conciseness, since a representation that uses too many features raises
major computational diculties and may lead to poor prediction accuracy. However, in
many supervised learning tasks the input is originally represented by a very large number
of features. In this scenario it might be possible to nd a smaller subset of features that can
lead to much better performance.
Feature selection is the task of choosing a small subset of features that is sucient to
predict the target labels well. Feature selection reduces the computational complexity of
learning and prediction algorithms and saves on the cost of measuring non selected features.
In many situations, feature selection can also enhance the prediction accuracy by improving
the signal to noise ratio. Another benet of feature selection is that the identity of the
vii
viii
selected features can provide insights into the nature of the problem at hand. Therefore
feature selection is an important step in ecient learning of large multi-featured data sets.
On a more general level the feature selection research eld clearly enters into research on
the fundamental issue of data representation.
In chapter 2 we discuss the necessity of feature selection. We raise the question of whether
a separate stage of feature selection is indeed still needed, or whether modern classiers can
overcome the presence of huge number of features. In order to answer this question we
present a new analysis of the simple two-Gaussians classication problem. We rst consider
the maximum likelihood estimation as the underlying classication rule. We analyze its error
as a function of the number of features and number of training instances, and show that
while the error may be as poor as chance when using too many features, it approaches the
optimal error if we chose the number of features wisely. We also explicitly nd the optimal
number of features as a function of the training set size for a few specic examples. Then,
we test SVM [14] empirically in this setting and show that its performance matches the
predictions in the analysis. This suggests that feature selection is still a crucial component
in designing an accurate classier, even when modern discriminative classiers are used, and
even if computational constraints or measuring costs are not an issue.
In chapter 3 we suggest new methods of feature selection for classication which are based
on the maximum margin principle. A margin [14, 100] is a geometric measure for evaluating
the condence of a classier with respect to its decision. Margins already play a crucial role
in current machine learning research. For instance, SVM [14] is a prominent large margin
algorithm. The novelty of the results presented in this chapter lies in the use of large margin
principles for feature selection. The use of margins allows us to devise new feature selection
algorithms as well as to prove a PAC (Probably Approximately Correct) style generalization
bound. The bound is on the generalization accuracy of 1-NN on a selected set of features,
and guarantees good performance for any feature selection scheme which selects a small set
of features while keeping the margin large. On the algorithmic side, we use a margin based
criterion to measure the quality of sets of features. We present two new feature selection
algorithms, G-ip and Simba, based on this criterion. The merits of these algorithms are
demonstrated on various datasets.

ix
In chapter 4 we discuss feature selection for regression (aka function estimation). Once
again we use the Nearest Neighbor algorithm and an evaluation function which is similar in
its nature to the one used for classication in chapter 3. This way we develop a non-linear,
simple, yet eective feature subset selection method for regression and use it in analyzing
cortical neural activity. This algorithm is able to capture complex dependency of the target
function upon its input and makes use of the leave-one-out error as a natural regularization.
We explain the characteristics of our algorithm with synthetic problems and use it in the
context of predicting hand velocity from spikes recorded in motor cortex of a behaving
monkey. By applying feature selection we are able to improve prediction quality and we
suggest a novel way of exploring neural data.
Finally, chapter 5 extends the standard framework of feature selection to consider gen-
eralization in the features axis. The goal of standard feature selection is to select a subset
of features from a given set of features. Here, instead of trying to directly determine which
features are better, we attempt to learn the properties of good features. For this purpose we
assume that each feature is represented by a set of properties, referred to as meta-features.

This approach has three main advantages. First, the selection problem can be considered
as a standard learning problem in itself. This novel viewpoint enables derivation of better
generalization bounds for the joint learning problem of selection and classication. Second,
it allows us to devise selection algorithms that can eciently explore for new good features in
the presence of a huge number of features. Finally, it contributes to a better understanding
of the problem. We also show how this concept can be applied in the context of inductive
transfer. We show that transferring the properties of good features between tasks might be
better than transferring the good features themselves. We illustrate the use of meta-features
in the dierent applications on a handwritten digit recognition problem.

Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction 1
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Machine Learning, Articial Intelligence and Statistics . . . . . . . . 7
1.1.4 Machine Learning and Neural Computation . . . . . . . . . . . . . . . 8
1.1.5 Other Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Common paradigms for feature selection . . . . . . . . . . . . . . . . . 16
1.2.2 The Biological/Neuroscience Rationale . . . . . . . . . . . . . . . . . 21
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Is Feature Selection Still Necessary? 24

2.1 Problem Setting and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Observations on The Optimal Number of Features . . . . . . . . . . . 28
2.3 Specic Choices of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 SVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
x
CONTENTS xi
3 Margin Based Feature Selection 38

3.1 Nearest Neighbor classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Margins for 1-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Margin Based Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Greedy Feature Flip Algorithm (G-ip) . . . . . . . . . . . . . . . . . 46
3.5.2 Iterative Search Margin Based Algorithm (Simba) . . . . . . . . . . . 47
3.5.3 Comparison to Relief . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.4 Comparison to R2W2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Empirical Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 The Xor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.2 Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7.3 Reuters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.4 Face Images with Support Vector Machines . . . . . . . . . . . . . . . 59
3.7.5 The NIPS-03 Feature Selection Challenge . . . . . . . . . . . . . . . . 62
3.8 Relation to Learning Vector Quantization . . . . . . . . . . . . . . . . . . . . 63
3.9 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A Complementary Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 66
4 Feature Selection For Regression 69

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 The Feature Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Testing on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Hand Movement Reconstruction from Neural Activity . . . . . . . . . . . . . 76
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Learning to Select Features 81

5.1 Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Predicting the Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . 84

xii CONTENTS
5.3 Guided Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Meta-features Based Search . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Illustration on Digit Recognition Task . . . . . . . . . . . . . . . . . . 88
5.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Generalization Bounds for Mufasa Algorithm . . . . . . . . . . . . . . 92
5.4.2 VC-dimension of Joint Feature Selection and Classication . . . . . . 96
5.5 Inductive Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.1 Demonstration on Handwritten Digit Recognition . . . . . . . . . . . 100
5.6 Choosing Good Meta-features . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7 Improving Selection of Training Features . . . . . . . . . . . . . . . . . . . . . 104
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A Notation Table for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Epilog 108
List of Publications 111
Bibliography 113
Summary in Hebrew I
Chapter 1
Introduction
This introductory chapter provides the context and background for the results discussed in
the following chapters and denes some crucial notation. The chapter begins with a brief
review of machine learning, which is the general context for the work described in this thesis.
I explain the goals of machine learning, present the main learning models currently used in
the eld, and discuss its relationships to other related scientic elds. Then, in section 1.2, I
review the eld of feature selection, which is the subeld of machine learning that constitutes
the core of this thesis. I outline the rationale for feature selection, the dierent paradigms
that are used and survey some of the most important known algorithms. I show the ways
in which research in feature selection is related to biology in section 1.2.2.
1.1 Machine Learning

Machine learning deals with the theoretic, algorithmic and applicative aspects of l earning
from examples. In a nutshell, learning from examples means that we try to build a machine
(i.e. a computer program) that can learn to perform a task by observing examples. Typically,
the program uses the training examples to build a model of the world that enables reliable
predictions. This contrasts with a program that can make predictions using a set of pre-
dened rules (the classical Articial Intelligence (AI) approach). Thus in machine learning
the machine must learn from its own experience, and in that sense it adheres to the very
1
2 Chapter 1. Introduction
old - but still wise - proverb of our Sages of Blessed Memory (Hazal):
oeiqp lrak mkg oi`
which translates to: experience is the best teacher. This proverb implies that a learner
needs to acquire experience on his or her own in order to achieve a good level of performance
and not be satised with the explanations given by the teacher (pre-dened rules in our case).
I elaborate on the advantages of this approach in section 1.1.3.
1.1.1 Supervised Learning

While there are a few dierent learning frameworks in machine learning (see section 1.1.5),
we focus here on supervised learning. In supervised learning we get a labeled sample as input
and use it to predict the label of a new unseen instance. More formally, we have a training
( )
set S m = {xi , y i }m
i=1 , x ∈ R
i N
and y i = c xi where c is an unknown, but xed function.
The task is to nd a mapping h from RN to the label set with a small chance of erring on
a new unseen instance, x ∈ RN , that was drawn according to the same probability function
as the training instances. c is referred to as the target concept and the N coordinates are
called features. An important special case is when c has only a nite number of categorical
values. In this case we have a classication problem . In most of this dissertation we focus
on classication (except in chapter 4).
The conventional assessment of the quality of a learning process is the generalization
ability of the learned rule, i.e. how well it can predict the value of the target concept
on a new unseen instance. We focus on inductive batch 1

learning , and specically on the
Probably Approximately Correct (PAC) learning model [118, 13], where it is assumed that
the training instances are drawn iid according to a xed (but unknown) distribution D.
Thus the generalization ability of a classication rule h is measured by its generalization

error, which is the chance to err on a new unseen instance that was drawn according to the
same distribution D. Formally it is dened as follows:
eD (h) = P rx∼D (h (x) 6= c (x))
However, we usually cannot measure this quantity directly. Thus we look at the training
1 see section 1.1.5 for a short review of other common learning models
1.1. Machine Learning 3
error, where the training error of a classier h with respect to a training set S of size m is
the percentage of the training instances that h err on; formally dened as follows:
1 ∑( )
eS (h) = 1 − δh(xi ),c(xi )
m i
where xi is the i's training instance and δ is Kronecker's delta.
From the theoretic point of view, machine learning attempts to characterize what is
learnable, i.e. under which conditions a small training error guarantees a small general-
ization error. Such a characterization can give us insights into the theoretic limitations
of learning that any learner must obey. It is clear that if h is allowed to be any possible
function, there is no way to bound the gap between generalization error and training error.
Thus we have to introduce some restrictive conditions. The common assumption is that we
choose our hypothesis from a given class of hypotheses H. The classical result here is the
characterization of the learnable hypotheses classes in the PAC model [118, 13]. Loosely
speaking, a hypotheses class is PAC-learnable if there is an algorithm that ensures that the
gap between the generalization error and the training error is arbitrarily small when the
number of training instances is large enough. The characterization theorem states that a
class is learnable if and only if it has a nite VC-dimension. The VC-dimension is a com-
binatorial property which measures the complexity of the class. The bound is tighter when
the VC-dimension is smaller, i.e. when the hypotheses class is simpler. Thus, this result is
a demonstration of principle of Occam's razor: lex parsimoniae (law of succinctness):
entia non sunt multiplicanda praeter necessitatem,
which translates to:
entities should not be multiplied beyond necessity.
Which often rephrased as: The simplest explanation is the best one .
Another kind of bounds on the generalization error of a classier are data dependent
bounds [103, 9]. Whereas standard bounds depends only on the size of the training set,
data dependent bounds take advantage of the fact that some training sets are better than
others. That is, the better the training set, the better bounds it gives on the generalization
error. This way, the data give bounds which are tighter than the standard bounds. A
main component in data dependent bounds is the concept of margins. Generally speaking,
a margin [14, 100] is a geometric measure for evaluating the condence of a classier with
respect to its decision. We elaborate on the denition and usage of margins in chapter 3.
We make extensive use of data dependent bounds in our theoretical results in chapter 3 and
chapter 5.
Algorithmically speaking, machine learning tries to develop algorithms that can nd a
good rule (approximation of the target concept) using the given training examples. Ideally,
it is preferable to nd algorithms for which it is possible to prove certain guarantees on the
running time and the accuracy of the result rule. However, heuristic learning algorithms that
just work are also abundant in the eld, and sometimes in practice they work better than
the alternative provable algorithms. One possible algorithmic approach to (classication)
supervised learning is to build a probabilistic model for each class (using standard statistic
tools) and then classify a new instance into the class with highest likelihood. However,
since many dierent probabilistic models imply the same decision boundary, it can be easier
to learn the decision boundary directly (the discriminative approach). Two such classic
algorithms are the Perceptron [98] and the One Nearest Neighbor (1-NN) [32] that were
introduced half a century ago and are still popular. The Perceptron directly nds a hyper-
plane that correctly separates the training instances by going over the instances iteratively
and updating the hyper-plane direction whenever the current plane errs. 1-NN on the other
hand simply stores the training instances and then assigns a new instance with the label of
the closest training instance.
The most important algorithm for supervised learning in current machine learning is the
Support Vector Machine (SVM) [14, 116], which nds the hyper-plane that separates the
training instances with the largest possible margin, i.e., the separating plane which is as
far from the training instances as possible. By an implicit mapping of the input vectors
to a high dimension Hilbert space (using a kernel function) SVM provides a systematic
tool for nding non-linear separations using linear tools. Another prominent algorithm for
supervised classication is the AdaBoost [33], which uses a weak-learner to build a strong
classier. It builds a set of classiers by re-running the weak classier on the same training
set, but each time putting more weight on instances where the previous classiers erred.
Many other algorithms for supervised learning exist and I only mention some of the most
important ones.
In terms of applications machine learning tries to adapt algorithms to a specic task.
Learning algorithms are widely used for many tasks both in industry and in academia, e.g.,
face recognition, text classication, medical diagnosis and credit card fraud detection, just
to name a few. Despite the fact that many learning algorithms are general and can be
applied to any domain, out-of-the-box learning algorithms do not generally work very well
on given real-world problems. Aside from the required tuning, each domain has its own
specic diculties and further eort may be required to nd the right representation of
instances (see section 1.1.2).
1.1.2 Representation
A key question in machine learning is how to represent the instances. In most learning
models it is assumed that the instances are given as a vector in RN (where N is any nite
dimension), and the analysis starts from there. There is a general consensus that once
we have a good representation, most reasonable learning methods will perform well after a
reasonable tuning eort. On the other hand, if we choose a poor representation achieving
a good level of performance is hopeless. But how do we choose the best way to represent
an abstract object (e.g. image) by a vector of numbers? A good representation should
be compact and meaningful at the same time. Is there a general method to nd such a
representation? Choosing a representation means choosing a set of features to be measured
on each instance. In real life, this set of features is usually chosen by a human expert in
the relevant domain who has a good intuition of what might work. The question is whether
it is possible to nd algorithms that use the training sample (and possibly other external
knowledge) in order to nd a good representation automatically.
If the instances are physical entities (e.g. a human patients), choosing the features means
choosing which physical measurements to perform on them. In other cases the instances are
given as a vectors of numbers in the rst place (e.g. the gray level of pixels of digital image)
and then the task of nding good representation (i.e. good set of features) is the task of
nding a transformation that convert the original representation into a better one. This can
be done without using the labels, in unsuprevised manner (see section 1.1.5) or using the
labels. If the instances originally described by a large set of raw features, one way to tackle
this is by using dimensionality reduction. In dimensionality reduction we look for a small
set of functions (of the raw features) that capture the relevant information. In the context
of supervised learning, dimensionality reduction algorithms try to nd a small number of
functions that preserve the information on the labels.
Feature selection is a special form of dimensionality reduction, where we limit ourselves
to choosing only a subset out of the given set of raw features. While this might seems to
be strong limitation, feature selection and general dimensionality reduction are not that
dierent, considering that we can always rst generate many possible functions of the raw
features (e.g. many kinds of lters and local descriptors of an image) and then use feature
selection to choose only some of them. This process of generating complex features by
applying functions on the raw features is called feature extraction. Thus, in other words,
using feature extraction followed by feature selection we can get a general dimensionality
reduction. Nevertheless, in feature extraction we have to select which features to generate out
of the huge (or even innite) number of possible functions of the raw features. We tackle this
issue in chapter 5. As will be explained in more detail in the section 1.2, the main advantages
of feature selection over other dimensionality reduction methods are interpretabilty and
economy (as it saves the cost of measuring the non- selected features).
The holy grail is to nd a representation which is concise and allow classication by
a simple rule at the same time, since a simple classier over low dimension generalize well.
However, this is not always possible, and therefore the is a tradeo between conciseness the
representation and complexity of the classier. Dierent methods may choose dierent work
point of this tradeo. While dimensionality reduction methods focus on conciseness, SVM
2
(and other kernel machines) convert the data into a very sparse representation in order to
allow very simple (namely linear) classication rule and control overtting by maximizing
the margin. However, In chapter 2 we suggest that the ability of SVM to avoid overtting on
high dimensional data is limited to scenarios where the data is laying on a low dimensional
manifold.
2 See section 1.1.1 for a short description of SVM

Another related issue in data representation is the tradeo between the conciseness of the
representation and its potential prediction accuracy. One principled way for quantifying this
tradeo, known as the Information Bottleneck [111], is to measure both the complexity of the
model and its prediction accuracy by using Shannon's mutual information, measuring both
complexity and accuracy by bits. During my PhD studies I also took major part in research
on the relation between the Information Bottleneck framework and some classical problems
in Information Theory (IT) such as a version of source coding with side information presented
by Wyner, Ahlswede and Korner (WAK) [121, 3], Rate-Distortion and Cost-Capacity [102].
In this research we took advantage of the similarities to obtain new results and insights both
on the IB and the classical IT problems. These results are not included in this thesis due to
space limitations. They where published in [39] and are under review process for publication
in IEEE-IT.
1.1.3 Machine Learning, Articial Intelligence and Statistics
Machine learning can also be considered a broad sub-eld of Articial Intelligence (AI). AI
refers to the ability of an articial entity (usually a computer) to exhibit intelligence. While
AI in general tends to prompt profound philosophical issues, as a research eld in computer
science it is conned to dealing with the ability of a computer to perform a task that is
usually done by humans and is considered as a task that requires intelligence. A classical
AI approach tackles such a task by using a set of logical rules (which were designed by a
human expert) that constitute a ow chart. Such systems are usually referred to as Expert
Systems, Rule-Based Systems or Knowledge-Based Systems [51]. The rst such systems were
developed during the 1960s and 1970s and became very popular and applied commercially
during the 1980s [76]. On the other hand, as already mentioned, in machine learning the
computer program tries to derive the rules by itself using a set of input-output pairs (aka
training set ). Thus machine learning focuses on the learning process.

For example, take the task of text classication. Here the task is to classify free-language
texts into one or more classes from a predened set of classes. Each class can represent
a topic, an author or any other property. A rule-based system for such a task usually
consists of pre-dened rules that query the existence of some words or phrases in the text
to be classied. The rules are extracted by humans, i.e., the programmer together with
an expert in the relevant eld who knows how to predict which phrases characterize each
class. There are several main drawbacks to this approach. First, the task of dening the
set of rules requires a great deal of expert human work and becomes virtually impossible
for large systems. Moreover, even if we already have such a system that actually works,
it is very hard to maintain it. Imagine that we want to add a new class to a system with
a thousand rules. It is very hard to predict the eect of changing even one rule. This is
particularly crucial because the environment changes over time, e.g., new classes appear and
old classes disappear, the language changes (who used the word tsunami before December
26, 2004?), and it is very awkward to adapt such a system to such changes. The machine
learning approach overcomes the above drawbacks since the system is built automatically
from a labeled training sample and the rule can be adapted automatically with any feedback
(i.e., an additional labeled instance). However, machine learning raises other problems such
as the interpretability of the decision and the need for a large amount of labeled instances.
These issues will be discussed in more detail in the following.
Machine learning is also closely related to statistics. Both elds try to estimate an un-
known function from a nite sample; thus there is a large overlap between them. Indeed
many statisticians claim that some of the results that are considered new in machine learn-
ing are well known in the statistics community. However, while the formal framework might
be similar, the applicative point of view is very dierent. While classical statistics focuses
on hypothesis testing and density estimation, machine learning focuses on how to create
computer programs that can learn to perform an intelligent task. One additional dier-
ence is that machine learning, as sub-eld of computer science, is also concerned with the
algorithmic complexity of computational implementations.
1.1.4 Machine Learning and Neural Computation
The brain of any animal and especially the human brain is the best known learner. Over
the course of a lifetime the brain has to learn how to perform a huge number of new, com-
plicated tasks and has to solve computational problems continuously and instantaneously.
For example the brain has to resolve the visual input that comes through the visual system
and produce the correct scene. This involves such problems as segmentation (which pixel
belongs to which object), detection of objects and estimation of objects' movements. Cor-
rect resolving is crucial for the creature's survival. There are several dierent approaches
to study the brain as a computational machine. One is to look inside the brain and try to
see how it works (Physiology); another is to try to build an abstract model of the brain
(regardless of the specic hardware) that ts the observations and can predict the brain's
behavior (Cognitive Psychology). The common sense behind using machine learning for
brain research is the belief that if we are able build a machine that can deal with the same
computational problems the brain has to solve, it will teach us something about how the
brain works. As regards theory, machine learning can provide the limitations on learning
process that any learner must obey , including the brain.
1.1.5 Other Learning Models

In life, dierent learning tasks can be very dierent from each other in many respects,
including the kind of input/feedback we get (instances alone or instances with some hints)
, the way we are tested (during the learning process or only at the end, number of allowed
mistakes) and the level of control we have over the learning process (can we aect the world or
just observe it?). Thus, many dierent models have been suggested for formalizing learning.
As already explained, this work focuses on passive-inductive-batch-supervised learning that
was presented in section 1.1.1. However, we refer to some other models as well. For this
reason and in order to give a more complete picture of the machine learning eld I briey
overview the other main models and point out the dierences between them.
Supervised vs. Unsupervised Learning
As already mentioned, in supervised learning the task is to learn an unknown concept c
from examples, where each instance x comes together with the value of the target function
for this instance, c (x). Thus the task is to nd an approximation of c, and performance
is measured by the quality of the approximation for points that did not appear in the
training set. On the other hand, in unsupervised learning the training sample includes
only instances, without any labels, and the task is to nd interesting structures in the
data. Typical approaches to unsupervised learning include clustering, building probabilistic

generative models and nding meaningful transformations of the data. Clustering algorithms
try to cluster the instances into a few clusters such that instances inside the same cluster
are similar and instances in two dierent clusters are dierent. A classical clustering
algorithm is the k -means [77] which clusters instances according to the Euclidean distance.
It starts with random locations of k cluster centers and then iteratively assigns instances
to the cluster of the closest center, and updates the centers' locations to the center of
gravity of the assigned instances. Clustering is also one of the more obvious applications
of the Information-Bottleneck [111], which was already mentioned in section 1.1.2. The
Information-Bottleneck is an information theoretical approach for extracting the relevant
information in one variable with respect to another variable, and thus can be used for
nding a clustering of one variable that preserves the maximum information on the other
variable. Generative models methods assume that the data were drawn from a distribution
of a certain form (e.g. mixture of Gaussians) and looks for the parameters that maximize
the likelihood of the data. The prominent algorithm here is the Expectation Maximization
(EM) [26] family of algorithms. Principal Component Analysis (PCA) [56] is a classic
example of an algorithm that looks for an interesting transformation of the data. PCA
nds the linear transformation into a given dimension that preserves the maximal possible
variance. Other prominent algorithms of this type are Multidimensional Scaling (MDS) [67],
Projection Pursuit [35, 50, 36, 57], Independent Component Analysis (ICA) [11] and the
newer Local Linear Embedding (LLE) [99].
Inductive vs transductive
Supervised learning can be divided into inductive learning and transductive learning [117].
In inductive learning we want our classier to perform well on any instance that was drawn
from the same distribution as the training set. On the other hand, in transductive learning it
is assumed that the test instances were known at training time (only the instances, not their
labels of course), and thus we want to nd a classier that performs well on these predened
test instances alone. These two models comply with dierent real life tasks. The dierence
can be illustrated on the task of text classication. Assume that we want to build a system
that will classify emails that will arrive in the future using a training set of classied emails
we received in the past. This ts the inductive learning model, since we do not have the
future emails at the training stage. On the other hand if we have an archive of a million
documents, where only one thousand of them are classied and we want to use this subset
to build a classier for classifying the rest of the archive, this ts the transductive model.
In other words, transductive learning ts situations where we have the test questions while
we are studying for the exam.
Batch vs. Online Learning

We can also divide supervised learning into batch learning and online learning [72]. In batch
learning it is assumed that the learning phase is separate from the testing phase, i.e., we
rst get a batch of labeled instances, we use them to learn a model (e.g. classier) and then
we use this model for making predictions on new instances. When analyzing batch models
it is usually assumed that the instances were drawn iid from an underlying distribution and
that the test instances will be drawn from the same distribution. The accuracy of a model
is dened as the expected accuracy with respect to this distribution. However, this model
does not t many real life problems, where we have to learn on the go and do not have
a sterile learning phase, i.e., we must be able to make predictions from the very beginning
and we pay for errors in the learning phase as well. Online learning assumes that we get
the instances one by one. We rst get the instance without a label and have to make a
prediction. Then we get the real label, and suer a loss if our prediction was wrong. Now
we have the chance to update our model before getting another instance and so on. In
analyzing online algorithms, performance is measured by the cumulative loss (i.e., the sum
of losses we suered so far), and the target is to achieve a loss which is not much more
than the loss made by the best model (out of the class of models we work with). Thus, in
online analysis, the performance measure is relative, and it is not necessary to assume any
assumptions on the way that the instances were produced. See [25] for a discussion on the
dierences, analogies and conversions between batch and online analysis.
Passive vs. Active Learning

In passive learning, the learner can only observe the instances given by the teacher. On the
other hand, in active learning models the learner has the opportunity to guide the learning
process in some way. Dierent active learning models dier in the way the learner can
aect the learning process (i.e. the instances s/he will get). For example in membership
query [6] model the learner can generate instance by him or herself and ask the teacher
for their label. This may be problematic in some domains, as the learner can generate
meaningless instances [68]. Another example is the selective sampling model [21]. Here the
learner observes the instances given by the teacher and decides which of them s/he wants
to get the label for. This model is most useful when it is cheap to get instances, but it
is expensive to label them. It can be shown that under some conditions active learning
can reduce exponentially the number of labels required to ensure a given level of accuracy
(compared to passive learning) ([34]). See [38] for an overview on active learning.
Reinforcement Learning
In many real life situations things are not so simple. The game is not just to predict some
label. Sometimes you are required to choose from a set of many actions, and you may be
rewarded or be penalized for your action and the action may aect the environment. The
Reinforcement learning (see e.g. [59]) model tries to capture the above observations and
to present a more complete model. It is assumed that the world has a state. At each
step the learner (aka agent in this context) takes an action which is chosen out of a set of
possible actions. The action may aect the world state. In each step the agent gets a reward
which depends on the world state and the chosen action. The agent's goal is to maximize
the cumulative reward or discounted reward, where discounted reward put more emphasis
on the reward that will be given in the near future. The world's current state may be
known to the agent (the fully observed model) or hidden (the partially observed model).
When the world's state is hidden, the agent can only observe some noisy measurement of
it. Under some Markovian assumptions it is possible to nd a good strategy eciently.
The reinforcement learning model is popular in the context of robotics, where people try to
create a robot that can behave reasonably in a changing environment.
Inductive transfer
Another observation about human learning is that a human does not learn isolated tasks,
but rather learns many tasks in parallel or sequentially. Inductive transfer [10, 110, 16]
is a learning model that captures this kind of characteristic of human learning. Thus, in
inductive transfer (a.k.a learning to learn, transfer learning or task2task ) one tries to use
knowledge gained in previous tasks in order to enhance performance on the current task.
In this context, the main question is the kind of knowledge we can transfer between tasks.
This is not trivial, as it is not clear how we can use, for an example, images which are
labeled cat or not cat in order to learn to classify images to table or not table. One
option, which is very popular, is to share the knowledge about the representation, i.e. to
assume that the same kind of features or distance measure is good for all the classes. For
example, [110] uses the previous tasks to learn a distance measure between instances. This
is done by constructing a training set of pairs of instances, where a pair is labeled 1 if and
only if we can be sure that they have the same label (i.e. they are from the same task and
have a positive label for the class of interest in this task). This make sense as the goal is
to nd a distance measure for which instances of the same class are close and instances of
dierent classes are well apart. They use Neural Network to learn the distance measure
from the pairs' training set. Another option, which interests us, is to share the knowledge
on which features are useful. We elaborate on this in chapter 5, where we show how our
new approach to feature selection can be applied to inductive transfer. Several authors have
noted that sometimes transferring knowledge can hurt performance on the target problem.
Two problems are considered as related if the transfer improves the performance and non-
related otherwise. However, it is clear that this notion is not well dened as the behavior
depends on the kind of information we choose to transfer, or more generally on the learning
algorithm [16].
1.2 Feature Selection

In many supervised learning tasks the input is represented by a very large number of features,
many of which are not needed for predicting the labels. Feature selection (variously known
as subset selection, attribute selection or variable selection ) is the task of choosing a small
subset of features that is sucient to predict the target labels well. The four main reasons
to use feature selection are:
1. Reduced computational complexity. Feature selection reduces the computational
complexity of learning and prediction algorithms . Many popular learning algorithms
become computationally intractable in the presence of huge numbers of features, both
in the training step and in the prediction step. A preceding step of feature selection
can solve the problem.
2. Economy. Feature selection saves on the cost of measuring non selected features .
Once we have found a small set of features that allows good prediction of the labels,
we do not have to measure the rest of the features any more. Thus, in the prediction
stage we only have to measure a few features for each instance. Imagine that we want
to predict whether a patient has a specic disease using the results of medical checks.
There are a huge number of possible medical checks that might be predictive; let's
say that there are 1000 potential checks and that each of them costs ten dollars to
perform. If we can nd a subset of only 10 features that allows good performance, it
saves a lot of money, and may turn the whole thing from an infeasible into a feasible
procedure.
3. Improved accuracy. In many situations, feature selection can also enhance predic-
tion accuracy by improving the signal to noise ratio. Even state-of-the-art learning
algorithms cannot overcome the presence of a huge number of irrelevant or weakly
relevant features. On the other hand once a small set of good features has been found,
even very simple learning algorithms may yield good performance. Thus, in such
situations, an initial step of feature selection may improve accuracy dramatically.
4. Problem understanding. Another benet of feature selection is that the identity of

1.2. Feature Selection 15
the selected features can provide insights into the nature of the problem at hand. This
is signicant since in many cases the ability to point out the most informative features
is more important than the ability to make a good prediction in itself. Imagine we are
trying to predict whether a person has a specic type of cancer using gene expression
data. While we can know whether the individual is sick or not in other ways, the
identity of the genes which are informative for prediction may give us a clue to the
disease mechanism involved, and help in developing drugs.
Thus feature selection is an important step in ecient learning of large multi-featured data
sets. Regarding reasons 2 and 4 above, feature selection has an advantage over other general
dimensionality reduction methods. Computing general functions of all the input features
means that we must always measure all the features rst, even if in the end we calculate
only a few functions of them. In many problems, dierent features vary in terms of their
nature and their units (e.g. body temperature in Celsius, yearly income in dollars). In these
cases feature selection is the natural formulation and it enables a better interpretation of
the results, as the meaning of the combination of the features is not very clear.
Many works dene the task of feature selection is detecting which features are relevant
and which are irrelevant. In this context we need to dene relevancy. This is not straight-
forward, because the eect of a feature depends on which other features we have selected.
Almuallim and Dietterich [4] provide a denition of relevance for the binary noise-free case.
They dene a feature to be relevant if it appears in any logical formula that describes the
target concept. Gennari et al. [37] suggest a probabilistic denition of relevancy, which de-
nes a feature to be relevant if it aects the conditional distribution of the labels. John et
al [55, 62] dene the notion of strong and weak relevance of a feature. Roughly speaking,
a strongly relevant feature is a useful feature that cannot be replaced by any other feature
(or set of features) whereas a weakly relevant feature is a feature which is useful, but can be
replaced by another feature (or set of features). They also show that, in general, relevance
does not imply optimality and that optimality does not imply relevance. Moreover, as we
show in chapter 2, even if a feature is relevant and useful, it may detract in the presence
of other features. Thus, when the concern is prediction accuracy, the best denition of the
task of feature selection as choosing a small subset of features that allows for good predic-
tion of the target labels. When the goal is problem understanding (reason 4 above), the
relevancy of each feature alone might be important. However, in real life problems features
are rarely completely irrelevant and thus it might be better to inquire about the importance
of each feature. Cohen et al. [20] suggested the Shapley value of a feature as a measure of
its importance. The Shapley value is a quantity taken from game theory; where it is used
to measure the importance of a player in a cooperative game.
Below I review the main feature selection paradigms and some of the immense number
of selection algorithms that have been presented in the past. However, a complete review of
feature selection methods is beyond the scope of this thesis. See [45] or [43] for a comprehen-
sive overview of feature selection methodologies. For a review of (linear) feature selection
for regression see [82].
1.2.1 Common paradigms for feature selection
Many dierent algorithms for the task of feature selection have been suggested over the last
few decades both in the Statistics and in the Learning community. Dierent algorithms
present dierent conceptual frameworks. However, in the most common selection paradigm
an evaluation function is used to assign scores to subsets of features and a search algo-
rithm is used to search for a subset with a high score. See gure 1.1 for an illustration of
this paradigm. Dierent selection methods can be dierent both in the choice of evaluation
function and the search method. The evaluation function can be based on the performance
of a specic predictor ( wrapper model, [62]) or on some general (typically cheaper to com-
pute) relevance measure of the features to the prediction ( lter model, [62]). The evaluation
function is not necessarily a black box and in many cases the search method can use
information on the evaluation function in order to perform an ecient search.
In most common wrappers, the quality of a given set of features is evaluated by testing
the predictor performance on a validation set. The main advantage of a wrapper is that
you optimize what really interest you - the predictor accuracy. The main drawback of
such methods is their computational deciency that limits the number of sets that can be
Initial Selected
feature set features
Search
method
score feature
subset
Evaluation
function
Figure 1.1: The most common selection paradigm: An evaluation function is used to
evaluate the quality of subsets of features and a search engine is used for nding a subset
with high score.
evaluated. The computational complexity of wrappers is high since we have to re-train
the predictor in each step. Common lters use quantities like conditional Variance (of
the features given the labels), Correlation Coecients or Mutual Information as a general
measure. In any case (wrapper or lter), an exhaustive search over all feature sets is generally
intractable due to the exponentially large number of possible sets. Therefore, search methods
are employed which apply a variety of heuristics. Two classic search methods are [78]:
1. Forward selection: start with an empty set of features and greedily add features one
at a time. In each step, the feature that produces the larger increase of the evaluation
function (with respect to the value of the current set) is added.
2. Backward Elimination: Start with a set of features that contains all the features
and greedily remove features one at a time. In each step the feature whose removal
results in the larger increase (or smaller decrease) in the evaluation function value is
removed.
Backward elimination has the advantage that when it evaluates the contribution of a feature
it takes into consideration all the other potential features. On the other hand, in forward
selection, a feature that was added at one point can become useless later on and vice versa.
However, since evaluating small sets of features is usually faster than evaluating large sets,
forward selection is much faster when we are looking for small number of features to select.
Moreover, if the initial number of features is very large, backward elimination become infea-
sible. A combination of the two is also possible of course, and has been used in many works.
Many other search methods such as stochastic hill climbing, random permutation [87] and
genetic algorithms [48] can be used as search methods as well.
I turn now to review a few examples of some of the more famous feature selection
algorithms. Almuallim and Dietterich [4] developed the FOCUS family of algorithms that
performs an exhaustive search in the situation where both the features and labels are binary
(or at least have small number of possible values). All feature subsets of increasing size are
evaluated, until a sucient set is found. A set is sucient if there are no conicts, i.e. if
there is no pair of instances with the same feature values but dierent labels. They also
presented heuristic versions of FOCUS that look only at promising subsets, and thus make
it feasible when the number of features is larger. Kira and Rendell [61] presented the RELIEF
algorithm, a distance based lter that uses a 1-Nearest Neighbor (1-NN) based evaluation
function (see chapter 3 for more details). Koller and Shamai [64] proposed a selection
algorithm that uses a Markov blanket [89]. Their algorithm uses backward elimination,
where a feature is removed if it is possible to nd a Markov blanket for it, or in other words, if
it possible to nd a set of features such that the feature is independent of the labels given this
set. Pfahringer [91] presented a selection method which is based on the Minimum Description
Length principle [97]. Vafaie and De Jong [115], used the average Euclidean distance between
instances in dierent classes as an evaluation function and genetic algorithms [48] as a search
method.
The above algorithms are all lters. The following are examples of wrappers. Aha and
Bankert [1] used a wrapper approach for instance-based learning algorithms (instance-based
algorithms [2] are extensions of 1-Nearest Neighbor [32, 22]) combined with a beam search
strategy, using a kind of backward elimination. Instead of starting with an empty or a full
set, they select a xed number of subsets (with a bias toward small sets) and start with the
best one among them. Then they apply backward elimination while maintaining a queue
of features which is ordered according to the contribution of a feature the last time it was
evaluated. Skalak [106] used a wrapper for the same kind of classiers, but with random
permutation instead of deterministic search. Caruana and Freitag [17] developed a wrapper
for decision trees [93, 94] with a search which is based on adding and removing features
randomly, but also removing in each step all the features which were not included in the
induced tree. Bala et al. [8] and Cherkauer and Shavlik [19] used a wrapper for decision
trees together with genetic algorithms as search strategies. Weston et al. [120] developed a
sophisticated wrapper for SVMs. Their algorithm minimizes a bound on the error of SVM
and uses gradient descent to fasten the search, but it still has to solve the SVM quadratic
optimization problem in each step, and in this sense it is a wrapper.
Individual feature ranking
Other feature selection methods simply rank individual features by assigning a score to each
feature independently. These methods are usually very fast, but inevitably fail in situations
where only a combined set of features is predictive of the target function. Two of the most
common rankers are:
1. Infogain [93], which assigns to each feature the Mutual Information between the feature
value and the label, considering both the feature and the label as random variables and
using any method to estimate the joint probability. Formally, let P be an estimation
of the joint probability for a feature, i.e., Pij represents the probability that both f
equal its i's possible value and the label is j, then we get the following formula for the
infogain of the feature f:
∑ Pij
IG (f ) = Pij log (∑ ∑ )
i,j i Pij j Pij
The most common way to estimate P is by simply counting the percent of the in-
stances that present each value pair (the empiric distribution) with some kind of
zero correction (e.g., adding a small constant to each value, and re-normalizing).
2. Correlation Coecients (corrcoef ), which assign each feature the Pearson correlation
between the vector of values the feature got in the training set (v ) and the vector of
the labels (y ) [92] i.e.,
E (v − E (y) (y − E (y)))
CC (f ) = √
V ar (v)V ar (y)
where the expectation and the variance are estimated empirically of course.
Corrcoef has the advantage (over infogain) that it can be used with continuous values (of
features or the label) without any need of quantization. On the other hand, Infogain has
the advantage that it does not assume any geometric meaning of the values, and thus can
work for any kind of features and label, even if they are not numerical values.
Many other individual feature rankers have been proposed. Among them are methods
which use direct calculations on the true-positive, true-negative, false-negative and false-
positive which are obtained using the given features. Other methods use dierent kinds
of distance measures between distributions to calculate the distance between the empirical
joint distribution and the expected one (assuming independency between the feature and
the label). Infogain is an example of such a method, with Kullback & Leibler divergence
(DKL ) as the distance measure. Other methods use statistical tests including the chi-square
([73, 74]), t-test ([107]) or the ratio of between-class variance to within-class variance, known
as the Fisher criterion ([29]). See [45], chapter 3 for a detailed overview of individual feature
ranking methods.
Embedded approach
The term Embedded methods is usually used to describe selection which is done automatically
by the learning algorithm. Note that almost any wrapper can be considered as an embedded
method, because selection and learning are interleaved. Decision tree [93, 94] learning can
also be considered to be an embedded method, as the construction of the tree and the
selection of the features are interleaved, but the selection of the feature in each step is
usually done by a simple lter or ranker. Other authors use the term embedded method
to refer only to methods where the selection is done by the learning algorithm implicitly. A
prominent example of such a method is the L1-SVM [86].

1.2.2 The Biological/Neuroscience Rationale
Feature selection is carried out by many biological systems. A sensory system is exposed to
an innite number of possible features, but has to select only some of them. For example,
the number of potential details in a visual scene is innite and only some (albeit a large
number) of them are perceived, even in the low level of the reticulum and V1. Moreover,
many sensory systems perform feature selection as part of the processing of the data. For
example, higher levels in the visual path have to choose from among the huge number of
features that the receptive elds in V1 produce. Ullman et al. [114] state that the problem
of feature selection is a fundamental question in the study of visual processing. They show
that intermediate complexity (IC) features are the optimal choice for classication. Serre et
al. [101] show that selecting object class-specic features improves object recognition ability
in a model of biological vision. This suggests that feature selection in biological systems is
not something that done only once, but rather is a dynamic process which is a crucial part
of the never-ending learning process of a living creature.
Another example is selecting relevant spectral cues in the auditory system. When a sound
is perceived by the ear, the level of modulation of each frequency depends on the direction
the sound came from. These changes are called spectral cues and the auditory system of
a human uses them to determine the elevation of the sound source [47]. Feature selection
occurs in the olfactory pathway as well. As described by Mori et al. [83], the mammalian
olfactory sensory neurons detect many dierent odor molecules and send information (using
their axons) to the olfactory bulb (OB). Each glomerulus in the OB is connected to sensory
neurons of the same type and can be considered as representing a feature. Lateral inhibition
among glomerular modules is a kind of feature selection mechanism. Perera et al. [90]
developed a feature selection method which is inspired by the olfactory system. This is
an example of something that happens frequently, where researchers adopt ideas from the
biological system when developing computerized algorithms.

Feature Selection in service of biological research
Nowadays biological experiments produce huge amounts of data that cannot be analyzed
without the aid of computerized tools. Feature selection is a useful tool for data analysis.
For example, assume that we want to measure the activity of many neurons during a given
task and then produce a set of features that represents many dierent kinds of measurements
of the neurons' activity. Now we can use feature selection to nd which kind of neuron and
which kind of activity characteristic are most informative for predicting the behavior. We
demonstrate such use of feature selection in chapter 4, where we use feature selection on
data recorded from the motor cortex of a behaving monkey.
1.3 Notation
Although specic notation is introduced in the chapters as needed, the following general
notation is used throughout this thesis. Vectors in RN are denoted by boldface small letters
(e.g. x, w). Scalars are denoted by small letters (e.g. x, y ). The i'th element of a vector x is
denoted by xi . log is the base 2 logarithm and ln is the natural logarithm. m is the number
of training instances. N denotes the number of all potential features while n denotes the
number of selected features. Subsets of features are denoted by F and a specic feature is
usually denoted by f. A labeled training set of size m is denoted by Sm (or by S when m is
clear from the context). The table below summarizes the main notation we keep along the
thesis for a quick reference.

1.3. Notation 23
Notation short description
m number of training instances
N total number of features
n number of selected features
F set of features
f a feature
w a weight vector over the features
Sm training set of size m
c the target classication rule
h a hypothesis
D a distribution over instances (and labels) space
ˆ γS
er (h) γ- sensitive training error of hypothesis h
erD (h) generalization error of hypothesis h w.r.t. D

Chapter 2
Is Feature Selection Still
Necessary?1
As someone who has researched feature selection for the last several years, I have often heard
people doubt its relevancy with the following argument: a good learning algorithm should
know to handle irrelevant or redundant features correctly and thus an articial split of the
learning process into two stages imposes excessive limits and can only impede performance.
There are several good rebuttals to this claim. First, as explained in section 1.2, improved
classication accuracy is not the only rationale for feature selection. Feature selection is also
motivated by improved computational complexity, economy and problem understanding.
Second, research on feature selection is a part of the broader issue of data representation.
However, to provide a complete answer, in this chapter we directly discuss the issue of
whether there is a need to do feature selection solely for improving classication accuracy
(ignoring other considerations), even when using current state-of-art learning algorithms.
Obviously, if the true statistical model is known, or if the sample is unlimited (and the
number of features is nite), any additional feature can only improve accuracy. However,
when the training set is nite, additional features can degrade the performance of many
1 The results presented in this chapter were rst presented as a chapter in the book Subspace, Latent
Structure and Feature Selection, edited by Saunders, C. and Grobelnik, M. and Gunn, S. and Shawe-Taylor,
J. [84]
24
25
classiers, even when all the features are statistically independent and carry information
on the label. This phenomenon is sometimes called the peaking phenomenon and was
already demonstrated more than three decades ago by [54, 113, 96] and in other works
(see references there) on the classication problem of two Gaussian-distributed classes with
equal covariance matrices (LDA). Recently, [49] analyzed this phenomenon for the case where
the covariance matrices are dierent (QDA); however, this analysis is limited to the case
where all the features have equal contributions. On the other hand [53] showed that, in the
Bayesian setting, the optimal Bayes classier can only benet from using additional features.
However, using the optimal Bayes classier is usually not practical due to its computational
cost and the fact that the true prior over the classiers is not known. In their discussion, [53]
raised the problem of designing classication algorithms which are computationally ecient
and robust with respect to the feature space. Now, three decades later, it is worth inquiring
whether today's state-of-the-art classiers, such as Support Vector Machine (SVM) [14],
achieve this goal.
Here we re-visit the two-Gaussian classication problem, and concentrate on a simple
setting of two spherical Gaussians. We present a new, simple analysis of the optimal number
of features as a function of the training set size. We consider the maximum likelihood
estimation as the underlying classication rule. We analyze its error as function of the
number of features and number of training instances, and show that while the error may
be as bad as chance when using too many features, it approaches the optimal error if we
chose the number of features wisely. We also explicitly nd the optimal number of features
as a function of the training set size for a few specic examples. We test SVM empirically
in this setting and show that its performance matches the predictions of the analysis. This
suggests that feature selection is still a crucial component in designing an accurate classier,
even when modern discriminative classiers are used, and even if computational constraints
or measuring costs are not an issue.

26 Chapter 2. Is Feature Selection Still Necessary?
2.1 Problem Setting and Notation

First, let us introduce some notation. Vectors in RN are denoted by bold face lower case
letter (e.g. x, µ) and the j 'th coordinate of a vector x is denoted by xj . We denote the
restriction of a vector x to the rst n coordinates by xn .
Assume that we have two classes in RN , labeled +1 and −1. The distribution of the
points in the positive class is N ormal (µ, Σ = I) and the distribution of the points in the
negative class is N ormal (−µ, Σ = I), where µ ∈ RN , and I is the N ×N unit matrix.
To simplify notation we assume, without loss of generality, that the coordinates of µ are
ordered in descending order of their absolute value and that µ1 6= 0; thus if we choose to use
only n<N features, the best choice would be the rst n coordinates. The optimal classier,
i.e., the one that achieves the maximal accuracy, is h (x) = sign (µ · x). If we are restricted
to using only the rst n features, the optimal classier is h (x, n) = sign (µn · xn ).
We assume that the model is known to the learner (i.e. that there two antipode spherical
Gaussian classes) , but the model parameter, µ, is not known. Thus, in order to analyze
this setting we have to consider a specic way to estimate µ from a training sample Sm =
{ }
i m
xi , y i=1
, where xi ∈ RN and y i ∈ {+1, −1} is the label of xi . We consider the maximum
likelihood estimator of µ:
1 ∑ i i
m
µ̂ = µ̂ (S m ) = yx
m i=1
Thus the estimated classier is ĥ (x) = sign (µ̂ · x). For a given µ̂ and number of features
n, we look on the generalization error of this classier:
error (µ̂, n) = P (sign (µ̂n · xn ) 6= y)
where y is the true label of x. This error depends on the training set. Thus, for a given
training set size m, we are interested in the average error over all the possible choices of a
sample of size m:
error (m, n) = ES m error (µ̂ (Sm ) , n) (2.1)
We look for the optimal number of features, i.e. the value of n that minimizes this error:
nopt = arg min error (m, n)

n
2.2. Analysis 27
2.2 Analysis
For a given µ̂, and n the dot product µ̂n · xn is a Normal random variable on its own and
therefore the generalization error can be explicitly written as (using the symmetry of this
setting):
(
)
Ex (µ̂n · xn )
error (µ̂, n) = P ( µ̂n · xn < 0| + 1) = Φ − √ (2.2)
Vx (µ̂n · xn )
1
∫ a − 1 z2
here Φ is the Gaussian cumulative density function: Φ (a) = √ e 2 dz . We denote
2π −∞
by Ex and Vx expectation and variance with respect to the true distribution of x.
For a given number of features n, and a given sample S, we have
∑
n
Ex (µ̂n · xn ) = µ̂n · µn = µ̂j µj
j=1
and
∑
n ∑
n
Vx (µ̂n · xn ) = µ̂2j Vx (xj ) = µ̂2j
j=1 j=1
substituting in equation (2.2) we get:

 ∑ 
n
µ̂ j µ j
error (µ̂, n) = Φ − √∑ 
j=1
(2.3)
n 2
j=1 j µ̂
Now, for a given training set size m, We want to nd n that minimizes the average error
term ES m (error (µ̂, n)), but instead we look for n that minimizes an approximation of the
average error:
 )
(∑
n
 ES m µ̂j µj  j=1
nopt = arg min Φ 
 − √ (

) (2.4)
n ∑n 2
ES m j=1 µ̂j
We rst have to justify why the above term approximates the average error. We look at
the variance of the relevant terms (the numerator and the term in the square root in the
denominator). µ̂j is a Normal random variable with expectation µj and variance 1/m, thus
 
∑ n ∑n
1 ∑ 2
n
VS m  µ̂j µj  = µ2j VS m (µ̂j ) = µ −−−−→ 0 (2.5)
j=1 j=1
m j=1 j m→∞
 
∑
n
2n 4 ∑ 2
n
VS m  µ̂2j  = + µ −−−−→ 0 (2.6)
j=1
m2 m α=1 j m→∞
( ) ( )
where in the last equality we used the fact that if Z ∼ N ormal µ, σ 2 , then V Z2 =
2σ 4 + 4σ 2 µ2 . We also note that:

 
∑
n n (
∑ )
ES m  µ̂2j  =
2
VS m (µ̂j ) + E (µ̂j )
j=1 j=1
∑n ( ) ∑n
1
= 2
µj + −−−−→ µ2j > 0
j=1
m m→∞
j=1
therefore the denominator is not zero for any value of m (including the limit m → ∞).
Combining this with (2.5) and (2.6) and recalling that the derivative of Φ is bounded by
1, we conclude that at least for a large enough m, it is a good approximation to move the
expectation inside the error term. However, we need to be careful with taking m to ∞, as
we know that the problem becomes trivial when the model is known, i.e., when n is nite
and m → ∞. Additionally, in our analysis we also want to take the limit n → ∞, and thus
we should be careful with the order of taking the limits. In the next section we see that the
∑n
case of bounded j=1 µ2j is of special interest. In this situation, for any ² > 0, there are
nite m0 and n0 such that the gap between the true error and the approximation is less than
² with a high probability for any (m, n) that satises m > m0 and n > n0 (including the
limit n → ∞). Thus, for any m > m0 , the analysis of the optimal number of features nopt
(under the constraint n > n0 ) as a function of m using the approximation is valid. When we
are interested in showing that too many features can hurt the performance, the constraint
n > n0 is not problematic. Moreover, gure 2.1 shows numerically that for various choices
∑n
of µ (including choices such that j=1 µ2j −−−−→ ∞), moving the expectation inside the
n→∞
error term is indeed justied, even for small value of m.
Now we turn to nding the n that minimizes (2.4), as function of the training set size
m. This is equivalent to nding n that maximizes

(∑ )
n ∑n ∑n
ES m j=1 µ̂j µj
2
µ2j
j=1 µj j=1
f (n, m) = √ (∑ ) = √∑n ( 2 ) = √n ∑n (2.7)
1
ES m
n 2 j=1 µj + + µ2j
j=1 µ̂j m m j=1
2.2.1 Observations on The Optimal Number of Features

√∑
n
First, we can see that, for any nite n, when m → ∞, f (n, m) reduces to j=1 µ2j
and thus, as expected, using all the features maximizes it. It is also clear that adding a
2.2. Analysis 29
(a) (b) (c)

0.5 0.4 0.2
0.4 0.15
0.3
0.3 0.1
0.2
0.2 0.05
0.1 0.1 0
0 50 100 0 50 100 0 200 400 600
0.2 0.2 0.35
0.15
true error 0.15 0.3
E inside
0.1 0.1 0.25
0.05 0.05 0.2
0 0 0.15
0 10 20 0 10 20 0 20 40 60
(d) (e) (f)
Figure 2.1: Numerical justication for using equation 2.4. The true and
approximate error (E-inside) as function of the number of features used for dierent
√ √
choices of µ: (a) µj = 1/ 2j , (b) µj = 1/j , (c) µj = 1/ j , (d) µj = 1, (e) µj = rand
(sorted) and (f ) µj = rand/j (sorted). The training set size here is m = 16. The true error
was estimated by averaging over 200 repeats. The approximation with the expectation
inside error term is very close to the actual error term (2.1), even for small training set
(m = 16).
completely non-informative feature µj (µj = 0) will decrease f (n, m). We can also formulate
a sucient condition for the situation where using too many features is harmful, and thus
feature selection can improve the accuracy dramatically:
∑n
Statement 1 For the above setting, if the partial sum series sn = j=1 µ2j < ∞ then for
any nite m the error of the ML classier approaches to 1/2 when n → ∞ and there is
n0 = n0 (m) < ∞ such that selecting the rst n0 features is superior to selecting k features
for any k > n0 .
Proof. Denote limn→∞ sn = s < ∞, then the numerator of (2.7) approaches s, while the
denominator approaches ∞, thus f (n, m) −−−−→ 0 and the error Φ (−f (n, m)) → 1/2. On
n→∞
the other hand f (n, m) > 0 for any nite n, thus there exists n0 such that f (n0 , m) >
f (k, m) for any k > n0 .

Note that it is not possible to replace the condition in the above statement by µj −−−→ 0
j→∞
(see example 2.4). The following consistency statement gives a sucient condition on the
number of features that ensure asymptotic optimal error:
Statement 2 For the above setting, if we use a number of features n = n (m) that satises
−−−→ ∞ and (2)
(1) n −m→∞ n
m −−−−→ 0
m→∞
then the error of the ML estimator approaches to
the optimal possible error (i.e. the error when µ is known and we use all the features)
∑n
when m → ∞. Additionally, if j=1 µj −−−−→ ∞,
n→∞
condition (2) above can be replaced with
m −
n
−−−→ c, where c is any nite constant.
m→∞
( √∑ )
∞
Proof. Recalling that the optimal possible error is given by Φ − j=1 µ j , the statement
follows directly from equation 2.7.
Corollary 2.1 Using the optimal number of features ensure consistency (in the sense of the
above statement).
Note as well that the eect of adding a feature depends not only on its value, but also
on the current value of the numerator and the denominator. In other words, the decision
whether to add a feature depends on the properties of the features we have added so far.
This may be surprising, as the features here are statistically independent. This apparent
dependency comes intuitively from the signal-to-noise ratio of the new feature to the existing
ones.
Another observation is that if all the features are equal, i.e. µj = c where c is a constant,
c2 √
f (n, m) = √ n
1
m + c2
and thus using all the features is always optimal. In this respect our situation is dierent
from the one analyzed by [54]. They considered Anderson's W classication statistic [5]
for the setting of two Gaussians with same covariance matrix, but both the mean and the
covariance were not known. For this setting they show that when all the features have
equal contributions, it is optimal to use m−1 features (where m is the number of training
examples).
2.3. Specic Choices of µ 31
2.3 Specic Choices of µ

Now we nd the optimal number of features for a few specic choices of µ.
( )
Example 2.1 Let µj = √1
2j
, i.e., µ = √1 , √1 , √1 , . . . , √1
2 4 8 2N
, thus kµk −n→∞
−−−→ 1. An
illustration of the density functions for this case is given in gure 2.2(a). Substituting this
µ in (2.7), we obtain:
1 − 21 2−n
f (n, m) = √
n
m + 1 − 21 2−n
Taking the derivative with respect to n and equating to zero we get2 :

( )
1 −2n −n n 1 1
− 2 ln 2 + 2 ln 2 1 + + − =0
4 m 2m m
Assuming n is large, we ignore the rst term and get:
1 n 1
m= 2 −n−
ln 2 2 ln 2
Ignoring the last two lower order terms, we have:
1 ln 2
√ ∼= √ n = (ln 2) µn
m 2
√
This makes sense as 1/ m is the standard deviation of µj , so the above equation says that
we only want to take features with a mean larger than the standard deviation. However, we
should note that this is true only in order of magnitude. No matter how small the rst
feature is, it is worth to take it. Thus, we have no hope to be able to nd optimal criterion
√
of the form: take feature j only if µj > f ( m).
Example 2.2 Let µj = 1/j . An illustration of the density functions for this case is given
∑ ( )2
in gure 2.2(b). Since ∞ = π6 ,
2
1
j=1 j
n ( )2
∑ ∫ ∞
1 π2 1 π2 1
> − 2
dx = −
j=1
j 6 n x 6 n
n ( )2
∑ ∫ ∞
1 π2 1 π2 1
< − dx = −
j=1
j 6 n+1 x2 6 n+1
2 The variable n is an integer, but we can consider f (n, m) to be dened for any real value.
thus the lower bound π2

6 −n
1
approximates the nite sum up to n1 − n+1
1
= 1
n(n+1) . Substituting
this lower bound in (2.7), we obtain:
π2
− 1
f (n, m) ∼
=√ 6 n
n
m + π2
6 − 1
n
taking the derivative with respect to n and equating to zero yields m ∼

= n3 π 2 −18n2
nπ 2 −6 and thus
√
for a large n we obtain the power law m ∼
= n2 , or n ∼
= m = m 2 , and again we have
1
√1 ∼
= 1
= µn .
m n
( )
Example 2.3 Let µj = √1
j
, i.e µ = 1, √12 , . . . √1N . An illustration of the separation
∑n
between classes for a few choices of n is given in gure 2.2(c). Substituting j=1 µ2j ∼
= log n
in (2.7) we get:
log n
f (n, m) = √ n (2.8)
m + log n
taking the derivative with respect to n and equating to zero, we obtain:
n (log n − 2)
m∼
=
log n
thus for a large n we have m ∼
= n, and once again we have √1
m
∼
= √1
n
= µn .
This example was already analyzed by Trunk ([113]) who showed that for any nite
number of training examples, the error approaches one half when the number of features
approaches ∞. Here we get this results easily from equation (2.8), as for any nite m,
f (n, m) −−−−→ 0, thus the error term Φ (−f (n, m)) approaches 1/2. on the other hand,
n→∞
when µ is known the error approaches zero when n increases, since kµk −n→∞
−−−→ ∞ while the
variance is xed. Thus, from corollary 2.1 we know that by using m = n the error approaches
to zero when m grows, and our experiments show that it drops very fast (for m ∼ 20 it is
already below 5% and for m ∼ 300 below 1%).
Example 2.4 In this example we show that the property µj −j→∞

−−→ 0 does not guarantee that
∑n
feature selection can improve classication accuracy. Dene sn = j=1 µ2j . Let µj be such
that sn = nα with 1
2 < α < 1. Since α < 1, it follows that indeed µj → 0. On the other
hand, by substituting in (2.7) we get:
nα nα/2
f (n, m) = √ n α
= √
m +n n1−α
+1 m
2.4. SVM Performance 33
n=1 n=3 n→ ∞
(a)
n=1 n=7 n→∞
(b)
n=1 n=30 n=106
(c)
Figure 2.2: Illustration of the separation between classes for dierent number of features
(n), for dierent choices of µ. The projection on the prex of µ of the density function of
each class and the combined density (in gray) are shown. (a) for example 2.1, (b) for
example 2.2 and (c) for example 2.3.
thus for n > mα−1 ,

√
nα/2 m
f (n, m) ≥ √ = nα−1/2
n1−α
2 m 2
and since α > 1/2, we have that f (n, m) −n→∞

−−−→ ∞ and thus using all the features is optimal.
We can see that the intuitive relation between the value of the smallest feature we want to
take and √1
m
that raised in the previous examples does not hold here. this demonstrate that
thinking on this value as proportional to √1
m
may be misleading.
2.4 SVM Performance

So far we have seen that the naive maximum likelihood classier is impeded by using too
many weak features, even when all the features are relevant and independent. However
it is worth testing whether a modern and sophisticated classier such as SVM that was
designed to work in very high dimensional spaces can overcome the peaking phenomenon.
For this purpose we tested SVM on the above two Gaussian setting in the following way.
m = 16 m = 100 m = 225
0.5
0.25 0.24
0.4 0.22
0.3 0.2 0.2
0.2 0.18
0.16
0.1 0.15
0 50 100 0 50 100 0 50 100
0.5 0.16
0.18
0.4
0.16 0.14
0.3
0.14
0.12
0.2 0.12
0.1 0.1 0.1

0 50 100 0 50 100 0 50 100
0.4
0.15 0.15 SVM error
0.3 ML error
0.1 0.1 optimal error
0.2
SVM+scaling
0.1 0.05 0.05
0 0 0
0 500 1000 0 500 1000 0 500 1000
Figure 2.3: The error of SVM as a function of the number

√ of features
√ used. The
top, middle and bottom rows correspond to µj equals 1/ 2j , 1/j and 1/ j respectively.
The columns correspond to training set sizes of 16, 100 and 225. The SVM error was
estimated by averaging over 200 repeats. The graphs show that SVM produces the same
qualitative behavior as the ML classier we used in the analysis and that pre-scaling of the
features strengthen the eect of too many features on SVM.
We generated a training set with 1000 features, and trained linear

3 SVM on this training
set 1000 times, each time using a dierent number of features. Then we calculated the
generalization error of each returned classier analytically. The performance associated with
a given number of features n is the generalization error achieved using n features, averaged
over 200 repeats. We used the SVM tool-box by Gavin Cawley [18]. The parameter C was
tuned manually to be C = 0.0001, the value which is favorable to SVM when all (1000)
features are used. The results for the examples described in section 2.3 for three dierent
choices of training set size are presented in gure 2.3.
We can see that in this setting SVM suers from using too many features just like the
3 The best classier here is linear, thus linear kernel is expected to give best results.
2.4. SVM Performance 35
4
maximum likelihood classier . On the other hand, it is clear that in other situations SVM
does handle huge number of features well, otherwise it could not be used together with
kernels. Therefore, in order to understand why SVM fails here, we need to determine in
what way our high dimensional scenario is dierent from the one caused by using kernels.
The assumption which underlies the usage of the large margin principle, namely that the
density around the true separator is low, is violated in our rst example (example 2.1), but
not for the other two examples (see gure 2.2). Moreover, if we multiply the µ of example
2.1 by 2, then the assumption holds and the qualitative behavior of SVM does not change.
Hence this cannot be a major factor.
One might suggest that a simple pre-scaling of the features, such as dividing each features
5
by an approximation of its standard deviation , might help SVM. Such normalization is
useful in many problems, especially when dierent features are measured in dierent scales.
However, in our specic setting it is not likely that such normalization can improve the
accuracy, as it just suppress the more useful features. Indeed, our experiments shows (see
gure 2.3, dotted lines) that the pre-scaling strengthen the eect of too many features on
SVM.
One signicant dierence is that in our setting the features are statistically independent,
whereas when the dimension is high due to the usage of kernels, the features are highly
correlated. In other words, the use of kernels is equivalent to deterministic mapping of low
dimensional space to a high dimensional space. Thus there are many features, but the actual
dimension of the embedded manifold is low whereas in our setting the dimension is indeed
high. We ran one initial experiment which supported the assumption that this dierence
was signicant in causing the SVM behavior. We used SVM with a polynomial kernel of
increasing degrees in the above setting with µ = (1, 0), i.e. one relevant feature and one
irrelevant feature. The results are shown in gure 2.4. The accuracy declines as the degree
increases as expected (since the best model here is linear). However, the eect of the kernel
degree on the dierence in accuracy using one or both features is not signicant, despite the
4 This strengthen the observation in [46] (page 384) that additional noise features hurts SVM performance.
5 Given the class, the standard deviation is the same for all the features (equals to 1), but the overall
standard deviation is larger for the rst features, as in the rst coordinates the means of the two classes a
more well apart.
0.4
using first feature

generalization error
0.3 using both features
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
kernel degree
Figure 2.4: The eect of the degree of a polynomial kernel on SVM error
µ = (1, 0) and the training set size is m = 20. The results were averaged over 200 repeats.
The eect of the kernel degree on the dierence in accuracy using one or both features is
not signicant.
fact that the number of irrelevant features grows exponentially with the kernel degree .
6
2.5 Summary
We started this chapter by asking the question whether feature selection is still needed even
when we only interested in classication accuracy, or maybe modern classiers can handle
the presence of a huge number of features well and only lose from the split of the learning
process into two stages (feature selection and learning of a classier). Using a simple setting
of two spherical Gaussians it was shown that in some situations using too many features
results in error as bad as chance while a wise choice of the number of features results in
almost optimal error. We showed that the most prominent modern classier, SVM, does not
handle large numbers of weakly relevant features correctly and achieves suboptimal accuracy,
much like a naive classier. We suggest that the ability of SVM to work well in the presence
of huge number of features may be restricted to cases where the underlying distribution is
concentrated around a low dimensional manifold, which is the case when kernels are used.
However, this issue should be further investigated. Thus we conclude that feature selection
is indeed needed, even if nothing but classication accuracy interests us.
This chapter focused on feature selection, but the fundamental question addressed here
6 When a polynomial kernel of degree k is used, only k features are relevant whereas 2k − k are irrelevant.
2.5. Summary 37
is relevant to the broader question of general dimensionality reduction. In this setting one
looks for any conversion of the data to a low dimensional subspace that preserves the relevant
properties. Thus one should ask in what way the optimal dimension depends on the number
of training instances. We expect that a similar trade-o can be found.

Chapter 3
Margin Based Feature Selection1
1 The results presented in this chapter were rst presented in our NIPS02 paper titled Margin Analysis
of the LVQ algorithm [24], our ICML04 paper titled Margin Based Feature Selection [40] and a chapter
titled Large Margin Principles for Feature Selection in the book Feature extraction, foundations and
applications, edited by Guyon, I. and Gunn, S. and Nikravesh, M. and Zadeh, L. [41]
38
39
After describing the main concepts in feature selection and the prime rationales, (see
section 1.2 and chapter 2), this chapter discusses the ways feature selection is carried out.
New methods of feature selection for classication based on the maximum margin principle
are introduced. A margin [14, 100] is a geometric measure for evaluating the condence of a
classier with respect to its decision. Margins already play a crucial role in current machine
learning research. For instance, SVM [14] is a prominent large margin algorithm. The main
novelty of the results presented in this chapter is the use of large margin principles for feature
selection. Along the way we also present the new observation that there are two types of
margins (sample margin and hypothesis margin ), and dene these two types of margins for
the One Nearest Neighbor (1-NN) [32] algorithm. We also develop a theoretical reasoning
for the 20 year old LVQ [63] algorithm (see section 3.8)
Throughout this chapter the 1-NN is used as the study-case predictor, but most of the
results are relevant to other distance based classiers (e.g. LVQ [63], SVM-RBF [14]) as
well. To demonstrate this, we compare our algorithms to the R2W2 algorithm [120], which
was specically designed as a feature selection scheme for SVM. We show that even in this
setting our algorithms compare favorably to all the contesters in the running..
The use of margins allows us to devise new feature selection algorithms as well as prove
a PAC (Probably Approximately Correct) style generalization bound. The bound is on
the generalization accuracy of 1-NN on a selected set of features, and guarantees good
performance for any feature selection scheme which selects a small set of features while
keeping the margin large. On the algorithmic side, we use a margin based criterion to
measure the quality of sets of features. We present two new feature selection algorithms,
G-ip and Simba, based on this criterion. The merits of these algorithms are demonstrated
on various datasets. Finally, we study the Relief feature selection algorithm [61] in the large
margin context. While Relief does not explicitly maximize any evaluation function, it is
shown here that implicitly it maximizes the margin based evaluation function.
Before we present the main results, we briey review the key concepts in 1-NN and
margins.
40 Chapter 3. Margin Based Feature Selection
3.1 Nearest Neighbor classiers
Though fty years have passed since the introduction of One Nearest Neighbor (1-NN) [32]
it is still a popular algorithm. 1-NN is a simple and intuitive algorithm but at the same
time may achieve state of the art results [105]. During the training process 1-NN simply
stores the given training instances (also referred to as prototypes ) without any processing.
When it required to classify a new instance, it returns the label of the closest instance
in the training set. Thus, 1-NN assumes that there is a way to measure distance between
instances. If the instances are given as vectors in Rn , we can simply use a standard norm (e.g.
l2 norm). However in many cases using more sophisticated distance measures that capture
our knowledge of the problem may improve the results dramatically. It is also possible
to learn a good distance measure from the training set itself. Many distance measures
have been suggested for dierent problems in the past. One example of the usage of such
sophisticated distance measure is the tangent distance that was employed by [104] for digit
recognition. This distance measure takes into account the prior knowledge that the digit
identity is invariant to small transformations such as rotation, translation etc. Using 1-NN
together with this distance measure [104] achieved state-of-the-art results on the USPS data
set [69].
However in large, high dimensional data sets it often becomes unworkable. One approach
to cope with this computational problem is to approximate the nearest neighbor [52] using
various techniques. An alternative approach is to choose a small data-set (aka prototypes)
which represents the original training sample, and apply the nearest neighbor rule only
with respect to this small data-set. This solution maintains the spirit of the original
algorithm, while making it feasible. Moreover, it might improve the accuracy by reducing
noise overtting. We elaborate on these methods in section 3.8.
Another possible solution for noise overtting is to use k -NN, which considers the k
nearest neighbors and takes the majority (for classication) or the mean (for regression)
over the labels of the k nearest neighbors. Here it is also possible to weigh the eect of
each of the k neighbors according to its distance. 1-NN ensures error which is not more
than twice the optimal possible error in the limit of innite number of training instances
3.2. Margins 41
[22]. k -NN, when is used with k = k (m) (where m is the number of training instances) that
satises k −−−−→ ∞ and

k
m −−−−→ 0 ensures optimal error in the limit m→∞ [28]. In
m→∞ m→∞
this sense k -NN is consistent. See [112] for a recent review on Nearest Neighbor and ways
to approximate it.
In the following we will use 1-NN as the study case predictor in developing our feature
selection methods and will show how feature selection can improve the performance of 1-NN.
3.2 Margins
The concept of margins play an important role in current research in machine learning. A
margin measures the condence a classier when making its predictions. Margins are used
both for theoretic generalization bounds and as guidelines for algorithm design. Theoret-
ically speaking, margins are a main component in data dependent bounds [103, 9] on the
generalization error of a classier. Data dependent bounds use the properties of a specic
training data to give bounds which are tighter than the standard bounds that hold for any
training set and depend only on its size (and the classier properties). Since in many cases
the margins give a better prediction of the generalization error than the training error itself,
they can be used by learning algorithms as a better measure for a classier quality. We
can consider margins as a stronger and more delicate version of the training error. Two
of the most prominent algorithms in the eld, Support Vector Machines (SVM) [116] and
AdaBoost [33] are motivated and analyzed by margins. Since the introduction of these algo-
rithms dozens of papers have been published on dierent aspects of margins in supervised
learning [100, 81, 15].
The most common way to dene the margin of an instance with respect to a classication
rule is as the distance between the instance and the decision boundary induced by the
classication rule. SVM uses this kind of margin. We refer to this kind of margin as sample
margin. However, as we show in [24], an alternative denition, Hypothesis Margin, exists.
In this denition the margin is the distance that the classier can travel without changing
the way it labels a sample instance. Note that this denition requires a distance measure
between classiers. This type of margin is used in AdaBoost [33]. The dierence between
(a) (b)
Figure 3.1: Illustration of the two types of margins for homogeneous linear classiers with
angle as distance measure between classiers. (a) sample margin measures how much an
instance can travel before it hits the decision boundary. (b) on the other hand a
hypothesis margin measures how much the hypothesis can travel before it hits an
instance.
these two denitions of margin is illustrated in gure 3.1.
While the margins themselves are a non-negative, the standard convention is to add a
sign to them that indicates whether the instance was correctlly calssied, where a negative
sign indicates wrong classication. We use the terms signed margin or unsigned margin to
indicate whether the sign is considered or not, but this prex is omitted in the following
whenever it is clear from the context. Note that the signed margin depends on the true label
and therefore is only available during the training stage.
3.3 Margins for 1-NN

Throughout this chapter we will be interested in margins for 1-NN. As described in sec-
tion 3.1, a 1-NN classier is dened by a set of training points (prototypes) and the decision
boundary is the Voronoi tessellation. The sample margin in this case is the distance between
the instance and the Voronoi tessellation, and therefore it measures the sensitivity to small
3.3. Margins for 1-NN 43
changes of the instance position. We dene the hypothesis margin for 1-NN as follows:
Denition 3.1 The (unsigned) hypothesis margin for 1-NN with respect to an instance x
is the maximal distance θ such that the following condition holds: if we draw a ball with
radius θ around each prototype, any change of the location of prototypes inside their θ ball
will not change the assigned label of the instance x.
Therefore, the hypothesis margin measures stability with respect to small changes in the
prototype locations. See gure 3.2 for illustration. The sample margin for 1-NN can be
unstable since small relocations of the prototypes might lead to a dramatic change in the
sample margin. Thus the hypothesis margin is preferable in this case. Furthermore, the
hypothesis margin is easy to compute (lemma 3.1) and lower bounds the sample-margin
(lemma 3.2).
Lemma 3.1 Let ν = (ν1 , . . . , νk ) be a set of prototypes. Let x be an instance then the
(unsigned) hypothesis margin of ν with respect to x is θ = 12 (kνj − xk − kνi − xk) where νi
is the closest prototype to x with the same label as x and νj is the closest prototype with
alternative label.
The proof of this lemma is straightforward from the denition.
Lemma 3.2 Let x be an instance and ν = (ν1 , . . . , νk ) be a set of prototypes. Then the (un-
signed) sample margin of x with respect to ν is greater or equal to the (unsigned) hypothesis
margin of ν with respect to x.
Proof. Let θ be larger than the sample margin, i.e. there exists x̂ such that kx − x̂k ≤ θ but
x and x̂ are labeled dierently by ν . Since they are labeled dierently, the closest prototype
to x is dierent than the closest prototype to x̂. W.L.G let ν1 be the closest prototype to x
and ν2 be the closest prototype to x̂.
Let w = x − x̂ then kwk ≤ θ. Dene ∀j ν̂j = νj + w. We claim that ν̂2 is the closest
prototype to x in ν̂ . This follows since
kx − ν̂j k = kx − (νj + w)k = kx̂ − νj k

(a) (b)
Figure 3.2: The two types of margins for the Nearest Neighbor rule. We consider a set of
prototypes (the circles) and measure the margin with respect to an instance (the square).
(a) the (unsigned) sample margin is the distance between the instance and the decision
boundary (the Voronoi tessellation). (b) the (unsigned) hypothesis margin is the largest
distance the prototypes can travel without altering the label of the new instance. In this
case it is half the dierence between the distance to the nearmiss and the distance to the
nearhit of the instance. Note that in this example the white prototype governs the
margin is not the same for both types of margins: for the hypothesis margin it is the one
closest to the new instance, whereas for the sample margin it is the one that denes the
relevant segment of the Voronoi tessellation.
Hence ν̂ assigns x a dierent label and also no prototype traveled a distance which is larger
than θ, thus the hypothesis margin is less than θ.
Since this result holds for any θ greater than the sample margin of ν we conclude that
the hypothesis margin of ν is less than or equal to its sample margin.
Lemma 3.2 shows that if we nd a set of prototypes with a large hypothesis margin then
it has a large sample margin as well.
3.4 Margin Based Evaluation Function

Our selection algorithm works on the most common selection paradigm that consists of an
evaluation function which assigns a score to a given subset of features and a searches method
that search for a set with a high score (see section 1.2.1 for more details on this paradigm).
In this section we introduce our margin based evaluation function.
A good generalization can be guaranteed if many instances have a large margin (see
3.4. Margin Based Evaluation Function 45
section 3.6). We introduce an evaluation function, which assigns a score to sets of features
according to the margin they induce. First we formulate the margin as a function of the
selected set of features.
Denition 3.2 Let P be a set of prototypes and x be an instance. Let w be a weight vector
over the feature set, then the margin of x is
1
θPw (x) = (kx − nearmiss (x)kw − kx − nearhit (x)kw ) (3.1)
2
√∑
where kzkw = i wi2 zi2 , nearhit (x) is the closest prototype to x with same label as x
and nearmiss (x) is the closest prototype to x with alternative label
Denition 3.2 extends beyond feature selection and allows weight over the features. When
selecting a set of features F we can use the same denition by identifying F with its indicating
vector. Therefore, we denote by θPF (x) := θPIF (x) where IF is one for any feature in F and
zero otherwise.
Since θλw (x) = |λ|θw (x) for any scalar λ, it is natural to introduce some normalization
factor. The natural normalization is to require max wi2 = 1, since it guarantees that kzkw ≤
kzk where the right hand side is the Euclidean norm of z.
Now we turn to dening the evaluation function. The building blocks of this function are
the margins of all the instances. The margin of each instance x is calculated with respect
to the sample excluding x (leave-one-out margin).
Denition 3.3 Let u(·) be a utility function. Given a training set S and a weight vector
w, the evaluation function is:
∑ ( )
w
e(w) = u θS\x (x) (3.2)
x∈S
The utility function controls the contribution of each margin term to the overall score. It
is natural to require the utility function to be non-decreasing; thus larger margin introduce
larger utility. We consider three utility functions: linear, zero-one and sigmoid. The linear
utility function is dened as u(θ) = θ. When the linear utility function is used, the evalu-
ation function is simply the sum of the margins. The zero-one utility is equals 1 when the
margin is positive and 0 otherwise. When this utility function is used the utility function
is proportional to the leave-one-out error. The sigmoid utility is u(θ) = 1/(1 + exp(−βθ)).
The sigmoid utility function is less sensitive to outliers than the linear utility, but does not
ignore the magnitude of the margin completely as the zero-one utility does. Note also that
for β→0 or β→∞ the sigmoid utility function becomes the linear utility function or the
zero-one utility function respectively. In the Simba algorithm we assume that the utility
function is dierentiable, and therefore the zero-one utility cannot be used.
It is natural to look at the evaluation function solely for weight vectors w such that
max wi2 = 1. However, formally, the evaluation function is well dened for any w, a fact
which we make use of in the Simba algorithm. We also use the notation e(F ), where F is a
set of features to denote e(IF ).
3.5 Algorithms
In this section we present two algorithms which attempt to maximize the margin based
evaluation function. Both algorithms can cope with multi-class problems. Our algorithms
can be considered as lter methods for general classiers. They also have much in common
with wrappers for 1-NN. A Matlab implementation of these algorithms is available at http:
//www.cs.huji.ac.il/labs/learning/code/feature_selection/. I also wrote a Matlab GUI
application that was designed to make it easy to perform feature selection and assess the
eect of the selection on classication accuracy. The tool is called Feature Selection Tool
(FST) and can be downloaded from http://www.cs.huji.ac.il/~anavot/feature_selection_
tool/fst.htm. The tool is mainly aimed at make it easier for researchers who are not
programmers to use and evaluate feature selection on their data. The tool supports the
selection algorithms which are presented in this chapter and some other related or classic
selection algorithms. The classication algorithms which are currently supported are 1-
Nearest-Neighbors, Naive Bayes and SVM.
3.5.1 Greedy Feature Flip Algorithm (G-ip)

The G-ip (algorithm 1) is a greedy search algorithm for maximizing e(F ), where F is a set of
features. The algorithm repeatedly iterates over the feature set and updates the set of chosen
3.5. Algorithms 47
Algorithm 1 Greedy Feature Flip (G-ip)

1. Initialize the set of chosen features to the empty set: F =φ
2. for t = 1, 2, . . .
(a) pick a random permutation s of {1 . . . N }

(b) for i=1 to N,
i. evaluate e1 = e (F ∪ {s(i)}) and e2 = e (F \ {s(i)})
ii. if e1 > e2 , F = F ∪ {s(i)}
else-if e2 > e1 , F = F \ {s(i)}
(c) if no change made in step (b) then break
features. In each iteration it decides to remove or add the current feature to the selected
set by evaluating the margin term (3.2) with and without this feature. This algorithm is
similar to the zero-temperature Monte-Carlo (Metropolis) method. It converges to a local
maximum of the evaluation function, as each step increases its value and the number of
possible feature sets is nite. The computational complexity of one pass over all features of
( )
a naive implementation of G-ip is Θ N 2 m2 where N m is
is the number of features and
( 2
)
the number of instances. However the complexity can be reduced to Θ N m since updating
the distance matrix can be done eciently after each addition/deletion of a feature from the
current active set. Empirically G-ip converges in a few iterations. In all our experiments
it converged after less than 20 epochs, in most of the cases in less than 10 epochs. A nice
property of this algorithm is that once the utility function is chosen, it is parameter free.
There is no need to tune the number of features or any type of threshold.
3.5.2 Iterative Search Margin Based Algorithm (Simba)

The G-ip algorithm presented in section 3.5.1 tries to nd the feature set that maximizes
the margin directly. Here we take another approach. We rst nd the weight vector w that
maximizes e(w) as dened in (3.2) and then use a threshold in order to get a feature set. Of
course, it is also possible to use the weights directly by using the induced distance measure
instead. Since e(w) is smooth almost everywhere, whenever the utility function is smooth,
we use gradient ascent in order to maximize it. The gradient of e(w) when evaluated on a
Algorithm 2 Simba
1. initialize w = (1, 1, . . . , 1)
2. for t = 1...T
(a) pick randomly an instance x from S

(b) calculate nearmiss (x) and nearhit (x) with respect to S \ {x} and the weight
vector w.
(c) for i = 1, . . . , N
calculate
( )
(xi − nearmiss (x)i ) (xi − nearhit (x)i )
2 2
1 ∂u(θ(x))
4i = − wi
2 ∂θ(x) kx − nearmiss (x)kw kx − nearhit (x)kw
(d)w =w+4
° °
3. w ← w2 / °w2 °∞ where (w2 )i := (wi )2 .
sample S is:
∂e(w) ∑ ∂u(θ(x)) ∂θ(x)

(5e(w))i = = (3.3)
∂wi ∂θ(x) ∂wi
x∈S
( )
1 ∑ ∂u(θ(x)) (xi − nearmiss (x)i ) (xi − nearhit (x)i )
2 2
= − wi
2 ∂θ(x) kx − nearmiss (x)kw kx − nearhit (x)kw
x∈S
In Simba (algorithm 2) we use a stochastic gradient ascent over e(w) while ignoring the
constraint kw2 k∞ = 1. In each step we evaluate only one term in the sum in (3.3) and add
it to the weight vector w. The projection on the constraint is done only at the end (step 3).
The computational complexity of Simba is Θ(T N m) where T is the number of iterations,
N is the number of features and m is the size of the sample S. Note that when iterating
( )
over all training instances, i.e. when T = m, the complexity is Θ N m2 .
3.5.3 Comparison to Relief

Relief [61] (algorithm 3) is a feature selection algorithm, which was shown to be very e-
cient for estimating feature quality. The algorithm holds a weight vector over all features
and updates this vector according to the instances presented. [61] proved that under some
assumptions, the expected weight is large for relevant features and small for irrelevant ones.
They also explain how to choose the relevance threshold τ in a way that ensures the prob-
3.5. Algorithms 49
Algorithm 3 RELIEF [61]

1. initiate the weight vector to zero: w=0
2. for t = 1...T,

(b) for i = 1 . . . N,
wi = wi + (xi − nearmiss (x)i ) − (xi − nearhit (x)i )
2 2
i.
3. the chosen feature set is {i|wi > τ } where τ is a threshold
ability that a given irrelevant feature chosen is small. Relief was extended to deal with
multi-class problems, noise and missing data by [65]. For multi-class problems [65] also
presents a version called Relief-F that instead of using the distance to the nearest point
with an alternative label, looks at the distances to the nearest instance of any alterna-
tive class and takes the average. In the experiments we made Relief-F was inferior to the
standard Relief.
Note that the update rule in a single step of Relief is similar to the one performed by
Simba when the utility function is linear, i.e. u(θ) = θ and thus ∂u(θ)/∂θ = 1. Indeed,
empirical evidence shows that Relief does increase the margin (see section 3.7). However,
there is a major dierence between Relief and Simba : Relief does not re-evaluate the dis-
tances according to the weight vector w and thus it is inferior to Simba. In particular, Relief
has no mechanism for eliminating redundant features. Simba may also choose correlated
features, but only if this contributes to the overall performance. In terms of computational
complexity, Relief and Simba are equivalent.
3.5.4 Comparison to R2W2

R2W2 [120] is the state-of-the-art feature selection algorithm for the Support Vector Ma-
chines (SVM) classier. This algorithm is a sophisticated wrapper for SVM and therefore
uses the maximal margin principle for feature selection indirectly. The goal of the algorithm
is to nd a weights vector over the features, which will minimize the objective function of
the SVM optimization problem. This objective function can be written as R2 W 2 where
R is the radius of a ball containing all the training data and W is the norm of the linear
separator. The optimization is done using gradient descent. After each gradient step a new
SVM optimization problem is constructed and solved. Thus it becomes cumbersome for
large scale data.
The derivation of R2W2 algorithm assumes that the data are linearly separable. Since
this cannot be guaranteed in the general case we use the ridge trick of adding a constant
value to the diagonal of the kernel matrix. Note also that R2W2 is designed for binary
classication tasks only. There are several ways in which it can be extended to multi class
problems. However, these extensions will make the algorithm even more demanding than
its original version.
As in SVM, R2W2 can be used together with a kernel function. We chose to use the
Radial Basis Function (RBF) kernel. The RBF kernel is dened to be
kx1 −x2 k
K(x1 , x2 ) = e− 2σ 2
where σ is a predened parameter. The choice of the RBF kernel is due to the similarity
between SVM with RBF kernel and the nearest-neighbor rule. Our implementation is based
on the one in the Spider package [119].
3.6 Theoretical Analysis

In this section we use feature selection and large margin principles to prove nite sample
generalization bounds for One Nearest Neighbor (1-NN), combined with the preceding step
of feature selection. [22] showed that asymptotically the generalization error of 1-NN can
exceed the generalization error of the Bayes optimal classication rule by at most a factor of
2. However, on nite samples, nearest neighbor can over-t and exhibit poor performance.
1-NN gives zero training error, on almost any sample. Thus the training error is too
rough to provide information on the generalization performance of 1-NN. We therefore need
a more detailed measure to provide meaningful generalization bounds and this is where
margins become useful. It turns out that in a sense 1-NN is a maximum margin algorithm.
Once our proper denition of margin is used, i.e. sample-margin, it is easy to verify that
3.6. Theoretical Analysis 51
1-NN generates the classication rule with the largest possible margin.
The combination of a large margin and a small number of features provides enough
evidence to obtain a useful bound on the generalization error. The bound we provide here is
data-dependent [103, 9]. Therefore, the value of the bound depends on the specic sample.
The bound holds simultaneously for any possible method to select a set of features. Thus, if
a selection algorithm selects a small set of features with a large margin, the bound guarantees
that 1-NN that is based on this set of features will generalize well. This is the theoretical
rationale behind Simba and G-ip.

We use the following notation in our theoretical results:
Denition 3.4 Let D be a distribution over X × {±1} and h : X −→ {±1} a classication

function. We denote by erD (h) the generalization error of h with respect to D:
erD (h) = Pr [h(x) 6= y]

x,y∼D
For a sample S = {(xk , yk )}m

k=1 ∈ (X × {±1})
m
and a constant γ > 0 we dene the γ -
sensitive training error to be
1 ¯¯{ }¯
er
ˆ γS (h) = ¯ (k : h(xk ) 6= yk ) or (xk has sample-margin < γ) ¯¯
m
Our main result is the following theorem :

2
Theorem 3.1 Let D be a distribution over RN ×{±1} which is supported on a ball of radius
R in RN . Let δ > 0 and let S be a sample of size m such that S ∼ Dm . With probability
1−δ over the random choice of S , for any set of features F and any γ ∈ (0, 1]
√ ( ( ) ( ) )
2 34em 8
erD (h) ≤ er (h) +
ˆ γS d ln log2 (578m) + ln + (|F | + 1) ln N
m d γδ
Where h is the nearest neighbor classication rule when distance is measured only on the
features in F and d = (64R/γ)|F | .
A few notes about this bound; First the size of the feature space, N, appears only
logarithmically in the bound. Hence, it has a minor eect on the generalization error of 1-
NN. On the other hand, the number of selected features, F, appears in the exponent. This
2 Note that the theorem holds when sample-margin is replaced by hypothesis-margin since the later lower
bounds the former.
is another realization of the curse of dimensionality [12]. See appendix A for the proof of
theorem 3.1.
A large margin for many instances will make the rst term of the bound small, while
using a small set of features will make the second term of the bound small. This gives us
the motivation to look for small sets of features that induce large margin, and that is what
G-ip and Simba do. As this bound is a worst case bound, like all the PAC style bounds, it
is very loose in most of the cases, and the empirical results are expected to be much better.
3.7 Empirical Assessment

We rst demonstrate the behavior of Simba on a small synthetic problem. Then we compare
the dierent algorithms on image and text classication tasks. The rst task is pixel (feature)
selection for discriminating between male and female face images. The second task is a
word (feature) selection for multi-class document categorization. In order to demonstrate
the ability of our algorithms to work with other classiers (beside of Nearest Neighbor) we
also report results with SVM with RBF kernel (see section 3.7.4). We also report the results
obtained on some of the datasets of the NIPS-2003 feature selection challenge [44]. For these
comparisons we have used the following feature selection algorithms: Simba with both linear
and sigmoid utility functions (referred as Simba(lin) and Simba(sig) respectively), G-ip
with linear, zero-one and sigmoid utility functions (referred as G-ip(lin), G-ip(zero-one)
and G-ip(sig) respectively), Relief, R2W2 and Infogain 3 .
3.7.1 The Xor Problem

To demonstrate the quality of the margin based evaluation function and the ability of the
Simba algorithm4 to deal with dependent features we use a synthetic problem. The problem
consisted of 1000 instances with 10 real valued features. The target concept is a xor function
over the rst 3 features. Hence, the rst 3 features are relevant while the other features are
irrelevant. Note that this task is a special case of parity function learning and is considered
3 Recall that Infogain ranks features according to the mutual information between each feature and the
labels (see chapter 1, section 1.2.1)
4 The linear utility function was used in this experiment.
3.7. Empirical Assessment 53
Figure 3.3: The results of applying Simba (solid) and Relief (dotted) on the xor synthetic
problem. Top: The margin value, e(w), at each iteration. The dashed line is the margin
of the correct weight vector. Bottom: the angle between the weight vector and the
correct feature vector at each iteration (in Radians).
hard for many feature selection algorithms [43]. Thus for example, any algorithm which does
not consider functional dependencies between features fails on this task. The simplicity
(some might say over-simplicity) of this problem, allows us to demonstrate some of the
interesting properties of the algorithms studied.
Figures 3.3 present the results we obtained on this problem. A few phenomena are
apparent in these results. The value of the margin evaluation function is highly correlated
with the angle between the weight vector and the correct feature vector (see gures 3.3
and 3.4). This correlation demonstrates that the margins characterize correctly the quality
of the weight vector. This is quite remarkable since our margin evaluation function can be
measured empirically on the training data whereas the angle to the correct feature vector is
unknown during learning.
As suggested in section 3.5.3 Relief does increase the margin as well. However, Simba,
which maximizes the margin directly, outperforms Relief quite signicantly. as shown in
gure 3.3.
Figure 3.4: The scatter plot shows the angle to the correct feature vector as function of the
value of the margin evaluation function. The values were calculated for the xor problem
using Simba during iterations 150 to 1000. Note the linear relation between the two
quantities.
Figure 3.5: Excerpts from the face images dataset.
3.7.2 Face Images

We applied the algorithms to the AR face database [80], which is a collection of digital
images of males and females with various facial expressions, illumination conditions, and
occlusions. We selected 1456 images and converted them to gray-scale images of 85 × 60
pixels, which are taken as our initial 5100 features. Examples of the images are shown in
gure 3.5. The task we tested is classifying the male vs. the female faces.
In order to improve the statistical signicance of the results, the dataset was partitioned
independently 20 times into training data of 1000 images and test data of 456 images. For
each such partitioning (split) Simba 5 , G-ip, Relief, R2W2 and Infogain were applied to
select optimal features and the 1-NN algorithm was used to classify the test data points. We
used 10 random starting points for Simba (i.e. random permutations of the train data) and
selected the result of the single run which reached the highest value of the evaluation function.
5 Simba was applied with both linear and sigmoid utility functions. We used β = 0.01 for the sigmoid
utility.
The average accuracy versus the number of features chosen, is presented in gure 3.6. G-
ip gives only one point on this plot, as it chooses the number of features automatically,
although this number changes between dierent splits, it does not change signicantly.
The features G-ip (zero-one) selected enabled 1-NN to achieve accuracy of 92.2% using
about 60 features only, which is better than the accuracy obtained with the whole feature
set (91.5%). G-ip (zero-one) outperformed any other alternative when only few dozens
of features were used. Simba(lin) signicantly outperformed Relief, R2W2 and Infogain,
especially in the small number of features regime. If we dene a dierence as signicant if
the one algorithm is better than the other in more than 90% of the partition, we can see
that Simba is signicantly better than Relief, Infogain and R2W2 when fewer than couple of
hundreds of features are being used (see gure 3.7). Moreover, the 1000 features that Simba
selected enabled 1-NN to achieve an accuracy of 92.8%, which is better than the accuracy
obtained with the whole feature set (91.5%).
A closer look on the features selected by Simba and Relief (gure 3.8) reveals the clear
dierence between the two algorithms. Relief focused on the hair-line, especially around
the neck, and on other contour areas in a left-right symmetric fashion. This choice is
suboptimal as those features are highly correlated to each other and therefore a smaller
subset is sucient. Simba on the other hand selected features in other informative facial
locations but mostly on one side (left) of the face, as the other side is clearly highly correlated
and does not contribute new information to this task. Moreover, this dataset is biased in
the sense that more faces are illuminated from the right. Many of them are saturated and
thus Simba preferred the left side over the less informative right side.
3.7.3 Reuters
We applied the dierent algorithms on a multi-class text categorization task. For these
purpose we used a subset of the Reuters-21578 dataset .

6 We have used the documents,
which are classied to exactly one of the following 4 topics: interest, trade, crude and grain.
The obtained dataset contains 2066 documents, which are approximately equally distributed
between the four classes. Each document was represented as the vector of counts of the
6 The dataset can be found at http://www.daviddlewis.com/resources/

Figure 3.6: Results for AR faces dataset. The accuracy achieved on the AR faces
dataset when using the features chosen by the dierent algorithms. The results were
averaged over the 20 splits of the dataset. For the sake of visual clarity, error bars are
presented separately in gure 3.7.
Figure 3.7: Error intervals for AR faces dataset. The accuracy achieved on the AR
faces dataset when using the features chosen by the dierent algorithms. The error
intervals show the area were 90% of the results (of the 20 repeats) fell, i.e., the range of the
results after eliminating the best and the worse iterations out of the 20 repeats.
(a) (b) (c) (d) (e) (f )
Figure 3.8: The features selected (in black) by Simba(lin) and Relief for the face
recognition task. 3.8(a), 3.8(b) and 3.8(c) shows 100, 500 and 1000 features selected by
Simba. 3.8(d), 3.8(e) and 3.8(f ) shows 100, 500 and 1000 features selected by Relief.
dierent words. Stop-words were omitted and numbers were converted to a predened
special character as a preprocessing.
To improve the statistical signicance of our results, the corpus was partitioned 20 times
into training set of 1000 documents and test set of 1066 documents. For each such partition-
ing, the words that appear less than 3 times in the training set were eliminated, which left
∼4000 words (features). For each partitioning G-ip, Simba, Relief, Relief-F and Infogain
were applied to select an optimal set of features
7 and the 1-NN was used to classify the test
documents. Simba and G-ip were applied with both linear and sigmoid utility functions .
8
G-ip was also applied with the zero-one utility function. We have used 10 random starting
points for Simba and selected the results of the single run that achieved the highest value
for the evaluation function. The average (over the 20 splits) accuracy versus the number
of chosen features is presented in gure 3.9. G-ip gives only one point on this plot, as it
chooses the number of features automatically, although this number changes between dier-
ent splits, it does not change signicantly. Another look on the results is given in gure 3.10.
This gure shows error intervals around the average that allow appreciating the statistical
signicance of the dierences in the accuracy. The top ranked twenty features are presented
in table 3.1.
The best overall accuracy (i.e. when ignoring the number of features used) was achieved
by G-ip(sig). G-ip(sig) got 94.09% generalization accuracy using ∼350 features. Infogain
and Simba(sig) are just a little behind with 92.86% and 92.41% , that was achieved using
∼40 and ∼30 features only (respectively). Relief is far behind with 87.92% that was achieved
using 250 features.
The advantage of Simba(sig) is very clear when looking on the accuracy versus the
number of features used. Among the algorithms that were tested, Simba(sig) is the only
algorithm that achieved (almost) best accuracy over the whole range. Indeed, when less
than few dozens of features are used, Infogain, achieved similar accuracy, but when more
features are used, it's accuracy drops dramatically. While Simba(sig) achieved accuracy
above 90% for any number of features between ∼20 and ∼3000, the accuracy achieved by
7 R2W2 was not applied for this problem as it is not dened for multi-class problems.
8 The sigmoid utility function was used with β = 1 for both G-ip and Simba.
Simba(sig) Simba(lin) Relief Infogain

oil rate oil corn ## s oil trade
wheat bpd ## brazil # bank tonnes wheat
trade days tonnes wish ### rate crude bank
tonnes crude trade rebate oil dlrs rate agriculture
corn japanese wheat water trade barrels grain barrels
bank decit bank stocks #### rates petroleum rates
rice surplus rice certicates billion bpd pct corn
grain indonesia rates recession opec pct taris japan
rates interest billion got tonnes barrel mt decit
ec s pct ve wheat cts energy bpd
Table 3.1: The rst 20 words (features) selected by the dierent algorithms for the Reuters
dataset. Note that Simba(sig) is the only algorithm which selected the titles of all four
classes (interest, trade, crude, grain) among the rst twenty features
Infogain dropped below 90% when more than ∼400 features are used, and below 85% when
more than ∼1500 are used.
The advantage of Simba(sig) over relief is also very clear. The accuracy achieved using
the features chosen by Simba(sig) is notably and signicantly better than the one achieved
using the features chosen by Relief, for any number of selected features. Using only the top
10 features of Simba(sig) yields accuracy of 89.21%, which is about the same as the accuracy
that can be achieved by any number of features chosen by Relief.
3.7.4 Face Images with Support Vector Machines

In this section we show that our algorithms work well also when using another distance
based classier, instead of the Nearest Neighbor classier. We test the dierent algorithms
together with SVM with RBF kernel classier and show that our algorithms works as good
as, and even better than the R2W2 algorithm that was tailored specically for this setting
and is much more computationally demanding.
We have used the AR face database [80]. We have repeated the same experiment as
described in section 3.7.2, the only dierence being that once the features were chosen, we
used SVM-RBF to classify the test data points. The sigma parameter used in the RBF
kernel was selected to be 3500, the same parameter used in the R2W2 feature selection
algorithm. The value for this parameter was tuned using cross-validation. See gure 3.11
100
95
90
85
Accuracy (%)
80
G−flip (lin)
75
G−flip (zero−one)
70 G−flip (sig)
Simba (lin)
65
Simba (sig)
60 Relief
Relief−F
55
Infogain
50
1 2 3
10 10 10
Number of Selected Features
Figure 3.9: Results for Reuters dataset. The accuracy achieved on the Reuters dataset
when using the features chosen by the dierent algorithms. The results were averaged over
the 20 splits of the dataset. For the sake of visual clarity, error bars are presented
separately in gure 3.10.
100
Accuracy (%)
90
80
Relief
70 Simba (lin)
60 Simba (sig)
Infogain
50
10 50 100 500
Number of Selected Features
Figure 3.10: Error intervals for Reuters dataset. The accuracy achieved on the
Reuters dataset when using the features chosen by the dierent algorithms. The error
intervals show the area were 90% of the results (of the 20 repeats) fell, i.e., the range of the
results after eliminating the best and the worse iterations out of the 20 repeats.
Figure 3.11: Results for AR faces dataset with SVM-RBF classier. The accuracy
achieved on the AR faces dataset when using the features chosen by the dierent
algorithms and SVM-RBF classier. The results were averaged over the 20 splits of the
dataset
for a summary of the results.
Both Simba and G-ip perform well, especially in the small number of features regime.
The results in the graph are the average over the 20 partitions of the data. Note that the
only winnings which are 90% signicant are those of Simba (lin) and G-ip (zero-one) when
only few dozens of features are used.
3.7.5 The NIPS-03 Feature Selection Challenge

The NIPS-03 feature selection challenge [44] was the prime motivator to develop the selection
algorithms presented in this chapter. The rst versions of G-ip were developed during the
challenge and Simba was developed following the challenge. We applied G-ip(lin) as part
3.8. Relation to Learning Vector Quantization 63
of our experiments in the challenge. The two datasets on which we used G-ip are ARCENE
and MADELON.
In ARCENE we rst used Principal Component Analysis (PCA) as a preprocessing and
then applied G-ip to select the principal components to be used for classication. G-ip
selected only 76 features. These features were fed to a Support Vector Machine (SVM) with
a RBF kernel and yielded a 12.66% balanced error (the best result on this dataset was a
10.76% balanced error).
In MADELON we did not apply any preprocessing. G-ip selected only 18 out of the 500
features in this dataset. Feeding these features to SVM-RBF resulted in a 7.61% balanced
error, while the best result on this dataset was a 6.22% error.
Although G-ip is a very simple and naive algorithm, it ranks as one of the leading
feature selection methods on both ARCENE and MADELON. It is interesting to note that
when 1-NN is used as the classication rule instead of SVM-RBF the error degrades only by
∼1% on both datasets. However, on the other datasets of the feature selection challenge, we
did not use G-ip either due to its computational requirements or due to poor performance.
Note that we tried only the linear utility function for G-ip and did not try Simba on any
of the challenge datasets, since it was developed after the challenged ended.
3.8 Relation to Learning Vector Quantization

As already mentioned in section 3.1, a possible way to improve the computational complexity
of 1-NN and avoid noise overtting for large high dimensional datasets is by replacing the
training set with a small number of prototypes. The goal of the training stage in this context
is to nd good small set of prototypes, i.e. a set that induces a small generalization error.
The most prominent algorithm for this task is the 20 year old Learning Vector Quantization
(LVQ) algorithm [63]. Several variants of LVQ have been presented in the past, but all of
them share the following common scheme. The algorithm maintains a set of prototypes,
ν = (ν1 , . . . , νk ), each is assigned with a predened label, which is kept constant during the
learning process. It cycles through the training data S m = {(xl , yl )}m

l=1 and in each iteration
modies the set of prototypes in accordance with one instance (xt , yt ). If the prototype νj
has the same label as yt it is attracted to xt but if the label of νj is dierent it is repelled
from it. Hence LVQ updates the closest prototypes to xt according to the rule:
νj ← νj ± αt (xt − νj ) , (3.4)
where the sign is positive if the label of xt and νj agree, and negative otherwise. The
parameter αt is updated using a predened scheme and controls the rate of convergence of
the algorithm. The variants of LVQ dier as to which prototypes they choose to update in
each iteration and as regards the specic scheme used to modify αt .
LVQ was presented as a heuristic algorithm. However, in [24] we show that it emerges
naturally from the margin based evaluation function presented in section 3.4. Up to very
minor dierences LVQ is a stochastic gradient ascent over this evaluation function, and the
dierent variants of LVQ correspond to dierent choices of utility function. In this sense
LVQ is dual to our margin-based selection algorithms. Both algorithms optimize the same
target function, but with respect to dierent parameters. In the selection algorithms the
position of prototypes is xed and the free parameter is the set of selected features, whereas
in LVQ the set of features is xed and the free parameters are the positions of the prototypes.
This new point of view on the veteran LVQ as a maximum margin algorithm enables us to
derive margin-based bounds on the generalization error of LVQ. This bound is similar in its
form to the bound presented in section 3.6. We do not go into detail in this section, as it
not directly related to feature selection and thus goes beyond the scope of this dissertation.
The details can be found in [24]. However, another nice observation is the following: SVM
looks for the linear classier with the maximal minimum margin over the training set. On
the other hand, among all the possible classiers, 1-NN has the maximal minimum margin
over the training set, but it has a very complicated structure. We show that LVQ looks for
the classier with maximal minimum margin for a given number of prototypes k. Recalling
that for k = 2 we get a linear classier, and for k =#instances we get 1-NN, we can consider
LVQ as a family of large margin classiers with a parameter k that controls the complexity
of the hypothesis class.

3.9. Summary and Discussion 65
3.9 Summary and Discussion

In this chapter, a margin-based criterion for measuring the quality of a set of features has
been presented. Using this criterion we derived algorithms that perform feature selection by
searching for the set that maximizes it. We suggested two new methods for maximizing the
margin based-measure, G-ip, which does a naive local search, and Simba, which performs
a gradient ascent. These are just some representatives of the variety of optimization tech-
niques (search methods) which can be used. We have also shown that the well-known Relief
algorithm [61] approximates a gradient ascent algorithm that maximizes this measure. The
nature of the dierent algorithms presented here was demonstrated on various feature selec-
tion tasks. It was shown that our new algorithm Simba, which is a gradient ascent on our
margin based measure, outperforms Relief on all these tasks. One of the main advantages
of the margin based criterion is the high correlation that it exhibits with feature quality.
This was demonstrated in gures 3.3 and 3.4.
The margin based criterion was developed using the 1-Nearest-Neighbor classier but we
expect it to work well for any distance based classier. Additionally to the test we made
with 1-NN, we also tested our algorithms with SVM-RBF classier and showed that they
compete successfully with the state-of-the-art algorithm that was designed specically for
SVM and is much more computationally demanding.
Our main theoretical result in this chapter is a new rigorous bound on the nite sample
generalization error of the 1- Nearest Neighbor algorithm. This bound depends on the margin
obtained following the feature selection.
In the experiments we have conducted, the merits of the new algorithms were demon-
strated. However, our algorithms use the Euclidean norm and assume that it is meaningful
as a measure of similarity in the data. When this assumption fails, our algorithms might not
work. Coping with other similarity measures will be an interesting extension to the work
presented here.
The user of G-ip or Simba should choose a utility function to work with. Here we have
demonstrated three such functions: linear, zero-one and sigmoid utility functions. The linear
utility function gives equal weight to all points and thus might be sensitive to outliers. The
sigmoid utility function suppresses the inuence of such outliers. We have also experimented
with the fact that G-ip with zero-one utility uses fewer features than G-ip with linear
utility, while the sigmoid utility lies in between. It is still an open problem how to adapt
the right utility for the data being studied. Nevertheless, much like the choice of kernel
for SVM, using a validation set it is possible to nd a reasonable candidate. A reasonable
initial value for the parameter β of the sigmoid utility is something on the same order of
magnitude as one over the average distance between training instances. As for the choice
between G-ip and Simba ; G-ip is adequate when the goal is to choose the best feature
subset, without a need to control its precise size. Simba is more adequate when ranking the
features is required.
A Complementary Proofs for Chapter 3

We begin by proving a simple lemma which shows that the class of nearest neighbor classiers
is a subset of the class of 1-Lipschitz functions. Let nnSF (·) be a function such that the sign
of nnSF (x) is the label that the nearest neighbor rule assigns to x, while the magnitude is
the sample-margin, i.e. the distance between x and the decision boundary.
Lemma 3.3 Let F be a set of features and let S be a labeled sample. Then for any x1 , x2 ∈
RN :
¯ S ¯
¯nnF (x1 ) − nnSF (x2 )¯ ≤ kF (x1 ) − F (x2 )k
where F (x) is the projection of x on the features in F .
Proof. Let x1 , x2 ∈ X . We split our argument into two cases. First assume that nnSF (x1 )
and nnSF (x2 ) have the same sign. Let z1 , z2 ∈ R|F | be the points on the decision boundary
of the 1-NN rule which are closest to F (x1 ) and F (x2 ) respectively. From the denition of
z1,2 it follows that nnSF (x1 ) = kF (x1 ) − z1 k and nnSF (x2 ) = kF (x2 ) − z2 k and thus
nnSF (x2 ) ≤ kF (x2 ) − z1 k
≤ kF (x2 ) − F (x1 )k + kF (x1 ) − z1 k
= kF (x2 ) − F (x1 )k + nnSF (x1 ) (3.5)

A. Complementary Proofs for Chapter 3 67
By repeating the above argument while reversing the roles of x1 and x2 we get
nnSF (x1 ) ≤ kF (x2 ) − F (x1 )k + nnSF (x2 ) (3.6)
Combining (3.5) and (3.6) we obtain
¯ S ¯
¯nnF (x2 ) − nnSF (x1 )¯ ≤ kF (x2 ) − F (x1 )k
The second case is when nnSF (x1 ) and nnSF (x2 ) have alternating signs. Since nnSF (·) is
continuous, there is a point z on the line connecting F (x1) and F (x2 ) such that z is on the
decision boundary. Hence,
¯ S ¯
¯nnF (x1 )¯ ≤ kF (x1 ) − zk
¯ S ¯
¯nnF (x2 )¯ ≤ kF (x2 ) − zk
and so we obtain
¯ S ¯ ¯ ¯ ¯ ¯
¯nnF (x2 ) − nnSF (x1 )¯ = ¯nnSF (x2 )¯ + ¯nnSF (x1 )¯
≤ kF (x2 ) − zk + kF (x1 ) − zk
= kF (x2 ) − F (x1 )k
The main tool for proving theorem 3.1 is the following:
Theorem 3.2 [9] Let H be a class of real valued functions. Let S be a sample of size m
generated i.i.d. from a distribution D over X × {±1} then with probability 1 − δ over the
choices of S , every h ∈ H and every γ ∈ (0, 1] let d = fatH (γ/32):
√ ( ( ) ( ))
2 34em 8
erD (h) ≤ er (h) +
ˆ γS d ln log (578m) + ln
m d γδ
We now turn to prove theorem 3.1:
Proof (of theorem 3.1): Let F be a set of features such that |F | = n and let γ > 0. In
order to use theorem 3.2 we need to compute the fat-shattering dimension of the class of
nearest neighbor classication rules which use the set of features F. As we saw in lemma 3.3
this class is a subset of the class of 1-Lipschitz functions on these features. Hence we can
bound the fat-shattering dimension of the class of NN rules by the dimension of Lipschitz
functions.
Since D is supported in a ball of radius R and kxk ≥ kF (x)k, we need to calculate the
fat-shattering dimension of Lipschitz functions acting on points in Rn with norm bounded
by R. The fatγ -dimension of the 1-NN functions on the features F is thus bounded by the
|F |
largest γ packing of a ball in Rn with radius R, which in turn is bounded by (2R/γ) .
Therefore, for a xed set of features F we can apply to theorem 3.2 and use the bound
on the fat-shattering dimension just calculated. Let δF > 0 and we have according to
theorem 3.2 with probability 1 − δF over sample S of size m that for any γ ∈ (0, 1]
erD (nearest-neighbor) ≤ ˆ γS (nearest-neighbor)

er + (3.7)
√ ( ( ) ( ))
2 34em 8
d ln log (578m) + ln
m d γδF
( ( )) ∑
|F | N
for d = (64R/γ) . By choosing δF = δ/ N |F | we have that F ⊆[1...N ] δF =δ and so
we can apply the union bound to (3.7) and obtain the stated result.
Chapter 4
Feature Selection For Regression
and its Application to Neural
Activity1
In this chapter we discuss feature selection for regression (aka function estimation). Once
again we use the Nearest Neighbor algorithm and an evaluation function which is similar
in nature to the one used for classication in chapter 3. This way we develop a non-linear,
simple yet eective feature subset selection method for regression. Our algorithm is able to
capture complex dependency of the target function on its input and makes use of the leave-
one-out error as a natural regularization. We explain the characteristics of our algorithm on
synthetic problems and use it in the context of predicting hand velocity from spikes recorded
in motor cortex of a behaving monkey. By applying feature selection we are able to improve
prediction quality and suggest a novel way of exploring neural data.
The selection paradigm presented in section 1.2.1 of the introduction which involves
an evaluation function and a search method is adequate for regression as well, but the
evaluation function should be suitable for regression; i.e., consider the continuous properties
1 The results presented in this chapter were rst presented in our NIPS05 paper titled Nearest Neighbor
Based Feature Selection for Regression and its Application to Neural Activity [85].
69
70 Chapter 4. Feature Selection For Regression
of the target function. A possible choice of such an evaluation function is the leave-one-
out (LOO) mean square error (MSE) of the k-Nearest-Neighbor (kNN) estimator ([27, 7]).
This evaluation function has the advantage that it both gives a good approximation of
the expected generalization error and can be computed quickly. [79] used this criterion on
small synthetic problems (up to 12 features). They searched for good subsets using forward
selection, backward elimination and an algorithm (called schemata ) that races feature sets
against each other (eliminating poor sets, keeping the ttest) in order to nd a subset with
a good score. All these algorithms perform a local search by ipping one or more features
at a time. Since the space is discrete the direction of improvement is found by trial and
error, which slows the search and makes it impractical for large scale real world problems
involving many features.
We extend the LOO-kNN-MSE evaluation function to assign scores to weight vectors

over the features, instead of just to feature subsets. This results in a smooth (almost
everywhere) function over a continuous domain, which allows us to compute the gradient
analytically and to employ a stochastic gradient ascent to nd a locally optimal weight
vector. The resulting weights provide a ranking of the features, which we can then threshold
in order to produce a subset. In this way we can apply an easy-to-compute, gradient directed
search, without relearning a regression model at each step but while still employing a strong
non-linear function estimate (kNN) that can capture complex dependency of the function
on its features.
Our original motivation for developing this method was to address a major computa-
tional neuroscience question: which features of the neural code are relevant to the observed
behavior? This is an important key to the interpretability of neural activity. Feature se-
lection is a promising tool for this task. Here, we apply our feature selection method to
the task of reconstructing hand movements from neural activity, which is one of the main
challenges in implementing brain computer interfaces [109]. We look at neural population
spike counts, recorded in motor cortex of a monkey while it performed hand movements and
locate the most informative subset of neural features. We show that it is possible to improve
prediction results by wisely selecting a subset of cortical units and their time lags relative to
the movement. Our algorithm, which considers feature subsets, outperforms methods that
4.1. Preliminaries 71
consider features on an individual basis, suggesting that complex dependency on a set of
features exists in the code.
4.1 Preliminaries
Let g (x), g : RN −→ R be a function that we wish to estimate. Given a set S ⊂ RN ,
the empiric mean square error (MSE) of an estimator ĝ for g is dened as M SES (ĝ) =
∑ 2
1
|S| x∈S (g (x) − ĝ (x)) .
kNN Regression As already explained in the previous chapter, k-Nearest-Neighbor (kNN)

is a simple, intuitive and ecient way to estimate the value of an unknown function in a given
point using its values in other (training) points. In section 3.1 we described the the classi-
cation version of kNN. Here we use the kNN for a regression problem, so instead of using
majority over the labels of the k nearest neighbors, we take the average of the function value.
Formally, Let S = {x1 , . . . , xm } be a set of training points. The kNN estimator is dened as
∑
the mean function value of the nearest neighbors: ĝ(x) = 1
k x0 ∈[email protected] (x) ĝ(x0 )
where [email protected] (x) ⊂ S is the set of k nearest points to x in S and k is a
parameter([27, 7]). A softer version takes a weighted average, where the weight of each
neighbor is proportional to its proximity. One specic way of doing this is
1 ∑ 0
ĝ(x) = g(x0 )e−d(x,x )/β (4.1)
Z
x0 ∈[email protected] (x)
∑ 0
d (x, x0 ) = kx − x0 k2 e−d(x,x )/β
2
where is the `2 norm, Z = x0 ∈[email protected] (x) is a
normalization factor and β is a parameter. The soft kNN version will be used in the remain-
der of this paper. This regression method is a special form of locally weighted regression (See
[7] for an overview of the literature on this subject). It has the desirable property that no
learning (other than storage of the training set) is required for the regression. Also note that
the Gaussian Radial Basis Function has the form of a kernel ([116]) and can be replaced
with any operator on two data points that decays as a function of the dierence between
them (e.g. kernel induced distances). As will be seen in the next section, we use the MSE
of a modied kNN regressor to guide the search for a set of features F ⊂ {1, . . . n} that
achieves a low MSE. However, the MSE and the Gaussian kernel can be replaced by other
loss measures and kernels (respectively) as long as they are dierentiable almost everywhere.
4.2 The Feature Selection Algorithm

In this section we present our selection algorithm called RGS (Regression, Gradient guided,
feature Selection). It can be seen as a lter method for general regression algorithms or as
a wrapper for estimation by the kNN algorithm.
Our goal is to nd subsets of features that induce a small estimation error. As in most
supervised learning problems, we wish to nd subsets that induce a small generalization
error, but since it is not known, we use an evaluation function on the training set. This
evaluation function is dened not only for subsets but for any weight vector over the features.
This is more general because a feature subset can be represented by a binary weight vector
that assigns a value of one to features in the set and zero to the rest of the features.
For a given weights vector over the features w ∈ Rn , we consider the weighted squared `2
2 ∑
norm induced by w, dened as kzkw = 2 2
i zi wi and use the k nearest neighbors according
to the distance induced by this norm. We use Nw (x) to denote the set of these k neighbors.
Given a training set S, we denote by ĝw (x) the value assigned to x by a weighted kNN
estimator, dened in equation 4.1, using the weighted squared `2 -norm as the distances
d(x, x0 ) and the nearest neighbors, Nw (x), are found among the points of S excluding x:
1 ∑ ∑
(xi −x0i )
2
ĝw (x) = g(x0 )e− i wi2 /β
, Nw (x) ⊂ S \ x
Z
x0 ∈Nw (x)
The evaluation function is dened as the negative MSE of the weighted kNN estimator:
1∑ 2
e(w) = − (g(x) − ĝw (x)) . (4.2)
2
x∈S
This evaluation function scores weight vectors (w). A change of weights will cause a
change in the distances and, possibly, the identity of each point's nearest neighbors, which
will change the function estimates. A weight vector that induces a distance measure in
which neighbors have similar labels receives a high score. Note that there is no explicit
regularization term in e(w). This is justied by the fact that for each point, the estimate of
4.2. The Feature Selection Algorithm 73
Algorithm 4 RGS (S, k, β, T )

1. initialize w = (1, 1, . . . , 1)
2. for t = 1...T

(b) evaluate the gradient of e(w) on x:
∇e(w) = − (g(x) − ĝw (x)) ∇w ĝw (x)

∑
− β4 x00 ,x0 ∈Nw (x) g(x00 )a(x0 , x00 ) u(x0 , x00 )
∇w ĝw (x) = ∑ 0 00
x00 ,x0 ∈Nw (x) a(x , x )
0 2 00 2
where a(x0 , x00 ) = e−(||x−x ||w +||x−x ||w )/β [ ]
and u(x0 , x00 ) ∈ RN is a vector with ui = wi (xi − x0i )2 + (xi − x00i )2 .
(c) w = w + ηt ∇e(w) = w (1 + ηt ∇w ĝw (x)) where ηt is a decay factor.
its function value does not include that point as part of the training set. Thus, equation 4.2
is a leave-one-out cross validation error. Clearly, it is impossible to go over all the weight
vectors (or even over all the feature subsets), and therefore some search technique is required.
Our method nds a weight vector w that locally maximizes e(w) as dened in (4.2) and
then uses a threshold in order to obtain a feature subset. The threshold can be set either
by cross validation or by nding a natural cuto in the weight values. However, we later
show that using the distance measure induced by w in the regression stage compensates for
taking too many features. Since e(w) is dened over a continuous domain and is smooth
almost everywhere we can use gradient ascent in order to maximize it. RGS (algorithm 4)
is a stochastic gradient ascent over e(w). In each step the gradient is evaluated using one
sample point and is added to the current weight vector. RGS considers the weights of all
the features at the same time and thus it can handle dependency on a group of features.
This is demonstrated in section 4.3. In this respect, it is superior to selection algorithms
that score each feature independently. It is also faster than methods that try to nd a good
subset directly by trial and error. Note, however, that convergence to a global optimum is
not guaranteed, and standard techniques to avoid local optima can be used.
The parameters of the algorithm are k (number of neighbors), β (Gaussian decay factor),
T (number of iterations) and {ηt }Tt=1 (step size decay scheme). The value of k can be tuned
by cross validation, however a proper choice of β can compensate for a k that is too large.
It makes sense to tune β to a value that places most neighbors in an active zone of the
Gaussian. In our experiments, we set β to half of the mean distance between points and their
k neighbors. It usually makes sense to use ηt that decays over time to ensure convergence,
however, on our data, convergence was also achieved with ηt = 1.
The computational complexity of RGS is Θ(T N m) where T is the number of iterations,
N is the number of features and m is the size of the training set S. This is correct for a
naive implementation which nds the nearest neighbors and their distances from scratch at
each step by measuring the distances between the current point to all the other points. RGS
is basically an on-line method which can be used in batch mode by running it in epochs
on the training set. When it is run for only one epoch, T = m and the complexity is
( )
Θ m2 N . Matlab code for this algorithm (and those that we compare it with) is available
at www.cs.huji.ac.il/labs/learning/code/fsr/
4.3 Testing on synthetic data

The use of synthetic data, where we can control the importance of each feature, allows us
to illustrate the properties of our algorithm. We compare our algorithm to other common
selection methods: infoGain [93], correlation coecients ( corrcoef ) and forward selection
(see [43]). infoGain and corrcoef simply rank features according to the mutual information
2
or the correlation coecient (respectively) between each feature and the labels (i.e. the
target function value). Forward selection ( fwdSel ) is a greedy method in which features
are iteratively added into a growing subset. In each step, the feature showing the greatest
improvement (given the previously selected subset) is added. This is a search method that
can be applied to any evaluation function and we use our criterion (equation 4.2 on feature
subsets). This well known method has the advantages of considering feature subsets and
that it can be used with non linear predictors. Another algorithm we compare with scores
each feature independently using our evaluation function (4.2). This helps us in analyzing
RGS, as it may help single out the respective contributions to performance of the properties
2 Feature and function values were binarized by comparing them to the median value.
4.3. Testing on synthetic data 75
1 1
0.5 0
1 1
0 −1
1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0 0 0
(a) (b)
2 1 −1 −1
1
0
0 1 1
1 1
−1 0.5 0.5 0.5 0.5
1 1 1 1
0.5 0.5 0.5 0.5 0 0 0 0
0 0
(c)
0 0
(d) (e) (f)
Figure 4.1: (a)-(d): Illustration of the four synthetic target functions. The plots shows the
function's value as function of the rst two features. (e),(f ): demonstration of the eect of
feature selection on estimating the second function using kNN regression (k = 5, β = 0.05).
(e) using both features (mse = 0.03), (f ) using the relevant feature only (mse = 0.004)
of the evaluation function and the search method. We refer to this algorithm as SKS (Single
feature, kNN regression, feature Selection) .

We look at four dierent target functions over R50 . The training sets include 20 to 100
points that were chosen randomly from the [−1, 1]50 cube. The target functions are given
in the top row of gure 4.2 and are illustrated in gure 4.1(a-d). A random Gaussian noise
with zero mean and a variance of 1/7 was added to the function value of the training points.
Clearly, only the rst feature is relevant for the rst two target functions, and only the rst
two features are relevant for the last two target functions. Note also that the last function
is a smoothed version of parity function learning and is considered hard for many feature
selection algorithms [43].
First, to illustrate the importance of feature selection on regression quality we use kNN to
estimate the second target function. Figure 4.1(e-f ) shows the regression results for target
(b), using either only the relevant feature or both the relevant and an irrelevant feature.
The addition of one irrelevant feature degrades the MSE ten fold. Next, to demonstrate
the capabilities of the various algorithms, we run them on each of the above problems with
varying training set size. We measure their success by counting the number of times that
the relevant features were assigned the highest rank (repeating the experiment 250 times by
re-sampling the training set). Figure 4.2 presents the success rate as function of training set
size.
We can see that all the algorithms succeeded on the rst function which is monotonic
and depends on one feature alone. infoGain and corrcoef fail on the second, non-monotonic
(a) x2
1 (b) sin(2πx1 + π/2) (c) sin(2πx1 + π/2) + x2 (d) sin(2πx1 ) sin(2πx2 )
100 100 100 100 corrcoef

success rate
80 80 80 infoGain
80 SKS
60 60 60 60 fwdSel
RGS
40 40 40 40
20 20 20 20
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

# examples # examples # examples # examples
Figure 4.2: Success rate of the dierent algorithms on 4 synthetic regression tasks
(averaged over 250 repetitions) as a function of the number of training examples. Success
is measured by the percent of the repetitions in which the relevant feature(s) received rst
place(s).
function. The three kNN based algorithms succeed because they only depend on local
properties of the target function. We see, however, that RGS needs a larger training set
to achieve a high success rate. The third target function depends on two features but the
dependency is simple as each of them alone is highly correlated with the function value. The
fourth, XOR-like function exhibits a complicated dependency that requires consideration of
the two relevant features simultaneously. SKS which considers features separately sees the
eect of all other features as noise and, therefore, has only marginal success on the third
function and fails on the fourth altogether. RGS and fwdSel apply dierent search methods.
fwdSel considers subsets but can evaluate only one additional feature in each step, giving it
some advantage over RGS on the third function but causing it to fail on the fourth. RGS
takes a step in all features simultaneously. Only such an approach can succeed on the fourth
function.
4.4 Hand Movement Reconstruction from Neural Activ-

ity
To suggest an interpretation of neural coding we apply RGS and compare it with the alter-
natives presented in the previous section

3 on the hand movement reconstruction task. The
data sets were collected while a monkey performed a planar center-out reaching task with
one or both hands [88]. 16 electrodes, inserted daily into novel positions in primary motor
3 fwdSel was not applied due to its intractably high run time complexity. Note that its run time is at
least r times that of RGS where r is the size of the optimal set and is longer in practice.
4.4. Hand Movement Reconstruction from Neural Activity 77
0.74 0.09 0.10 0.77 RGS

SKS
infoGain
corrcoef
0.63 0.06 0.08 0.27

7 200 400 600 7 200 400 600 7 200 400 600 7 200 400 600
Figure 4.3: MSE results for the dierent feature selection methods on the neural activity
data sets. Each sub gure is a dierent recording day. MSEs are presented as a function of
the number of features used. Each point is a mean over all 5 cross validation folds, 5
permutations on the data and the two velocity component targets. Note that some of the
data sets are harder than others.
cortex were used to detect and sort spikes in up to 64 channels (4 per electrode). Most of
the channels detected isolated neuronal spikes by template matching. Some, however, had
templates that were not tuned, producing spikes during only a fraction of the session. Others
(about 25%) contained unused templates (resulting in a constant zero producing channel or,
possibly, a few random spikes). The rest of the channels (one per electrode) produced spikes
by threshold passing. We construct a labeled regression data set as follows. Each example
corresponds to one time point in a trial. It consists of the spike counts that occurred in the
10 previous consecutive 100ms long time bins from all 64 channels (64 × 10 = 640 features)
and the label is the X or Y component of the instantaneous hand velocity. We analyze data
collected over 8 days. Each data set has an average of 5050 examples collected during the
movement periods of the successful trials.
In order to evaluate the dierent feature selection methods we separate the data into
training and test sets. Each selection method is used to produce a ranking of the features.
We then apply kNN (based on the training set) using dierent size groups of top ranking
features to the test set. We use the resulting MSE (or correlation coecient between true
and estimated movement) as our measure of quality. To test the signicance of the results
we apply 5-fold cross validation and repeat the process 5 times on dierent permutations
of the trial ordering. Figure 4.3 shows the average (over permutations, folds and velocity
components) MSE as a function of the number of selected features on four of the dierent
data sets (results on the rest are similar and omitted due to lack of space) .
4 It is clear
that RGS achieves better results than the other methods throughout the range of feature
numbers.
4 We use k = 50 (approximately 1% of the data points). β is set automatically as described in section 4.2.
These parameters were manually tuned for good kNN results and were not optimized for any of the feature
selection algorithms. The number of epochs for RGS was set to 1 (i.e. T = m).
To test whether the performance of RGS was consistently better than the other methods
we counted winning percentages (the percent of the times in which RGS achieved lower
MSE than another algorithm) in all folds of all data sets and as a function of the number of
features used. Figure 4.4 shows the winning percentages of RGS versus the other methods.
For a very low number of features, while the error is still high, RGS winning scores are
only slightly better than chance but once there are enough features for good predictions the
winning percentages are higher than 90%. In gure 4.3 we see that the MSE achieved when
using only approximately 100 features selected by RGS is better than when using all the
features. This dierence is indeed statistically signicant (win score of 92%). If the MSE is
replaced by correlation coecient as the measure of quality, the average results (not shown
due to lack of space) are qualitatively unchanged.
RGS not only ranks the features but also gives them weights that achieve locally optimal
results when using kNN regression. It therefore makes sense not only to select the features
but to weigh them accordingly. Figure 4.5 shows the winning percentages of RGS using
the weighted features versus RGS using uniformly weighted features. The corresponding
MSEs (with and without weights) on the rst data set are also displayed. It is clear that
using the weights improves the results in a manner that becomes increasingly signicant as
the number of features grows, especially when the number of features is greater than the
optimal number. Thus, using weighted features can compensate for choosing too many by
diminishing the eect of the surplus features.
To take a closer look at what features are selected, gure 4.6 shows the 100 highest
ranking features for all algorithms on one data set. Similar selection results were obtained
in the rest of the folds. One would expect to nd that well isolated cells (template matching)
are more informative than threshold based spikes. Indeed, all the algorithms select isolated
cells more frequently within the top 100 RGS

features ( does so in 95% of the time and the
rest in 70%-80%). A human selection of channels, based only on looking at raster plots
and selecting channels with stable ring rates was also available to us. This selection was
independent of the template/threshold categorization. Once again, the algorithms selected
the humanly preferred channels more frequently than the other channels. Another and more
interesting observation that can also be seen in the gure is that while corrcoef, SKS and
4.5. Summary 79
RGS
SKS
corrCoef
infoGain
Figure 4.6: 100 highest ranking features (grayed out) selected by the algorithms. Results
are for one fold of one data set. In each sub gure the bottom row is the (100ms) time bin
with least delay and the higher rows correspond to longer delays. Each column is a channel
(silent channels omitted).
infoGain tend to select all time lags of a channel, RGS 's selections are more scattered (more
channels and only a few time bins per channel). Since RGS achieves the best results, we
100 100 0.8
90 90
winning percentage
winning percentage
80 RGS vs SKS 80 winning percentages
MSE
RGS vs infoGain uniform weights
70 RGS vs corrcoef 70 non−uniform weights
60 60
50 50 0.6
0 100 200 300 400 500 600 0 100 200 300 400 500 600
number of features
number of features
Figure 4.4: Winning percentages of RGS Figure 4.5: Winning percentages of RGS
over the other algorithms. RGS achieves with and without weighting of features
better MSEs consistently. (black). Gray lines are corresponding MSEs

of these methods on the rst data set.
conclude that this selection pattern is useful. Apparently RGS found these patterns
thanks to its ability to evaluate complex dependency on feature subsets. This suggests that
such dependency of the behavior on the neural activity does exist.
4.5 Summary
In this chapter we presented a new method of selecting features for function estimation and
use it to analyze neural activity during a motor control task . We used the leave-one-out
mean squared error of the kNN estimator and minimize it using a gradient ascent on an
almost smooth function. This yields a selection method which can handle a complicated
dependency of the target function on groups of features yet can be applied to large scale
problems. This is valuable since many common selection methods lack one of these prop-
erties. By comparing the result of our method to other selection methods on the motor
control task, we showed that consideration of complex dependency helps to achieve better
performance. These results suggest that this is an important property of the code.
Chapter 5
Learning to Select Features1
In the standard framework of feature selection discussed in previous chapters, the task is
to nd a (small) subset of features out of a given set of features that is sucient to predict
the target labels well . Thus, the feature selection methods discussed so far of tell us which
features are better. However, they do not tell us what characterizes these features or how
to judge new features which were not evaluated using the labeled data. In this chapter
we extend the standard framework and present a novel approach to the task of feature
selection that attempts to learn properties of good features. We claim that in many cases
it is natural to represent each feature by a set of properties, which we call meta-features.

As a simple example, in image related tasks where the features are gray-levels of pixels,
two meta-features can be the (x, y) position of each pixel. We use the training set in order
to learn the relation between the meta-feature values and feature usefulness. This in turn
enables us to predict the quality of unseen features. This may be useful in many applications,
e.g., for predicting the future usefulness of a new word (that did not appear in the training
set) for text classication. Another possible application is for predicting the worthiness
of a candidate medical examination based on the properties of this examination and the
properties and the worthiness of previous examinations. It is also a very useful tool for
feature extraction. As described in section 1.1.2, in feature extraction the original features
1 The results presented in this chapter were rst presented in a paper titled Learning to Select Features
using their Properties submitted to JMLR at 29 August, 2006.
81
82 Chapter 5. Learning to Select Features
are used to generate new more complex features. One example for such extracted features
in image related tasks are all the products of sets of 3 pixels. Any function of the original
features can be used as an extracted feature, and there are a virtually innite number of
such functions. Thus the learner has to decide which potential complex features have a good
chance of being the most useful. For this task we derive a selection algorithm (called Mufasa )
that uses meta-features to explore a huge number of candidate features eciently. We also
derive generalization bounds for the joint problem of feature selection (or extraction) and
classication when the selection is made using meta-features. These bounds are better than
the bounds obtained for direct selection.
We also show how our concept can be applied in the context of inductive transfer ([10,
110, 16]). As described in section 1.1.5, in inductive transfer one tries to use knowledge
acquired in previous tasks to enhance performance on the current task. The rationale lies
in the observation that a human does not learn isolated tasks, but rather learns many tasks
in parallel or sequentially. In this context, one key question is the kind of knowledge we
can transfer between tasks. One option, which is very popular, is to share the knowledge
about the representation, or more specically, knowledge about feature usefulness. Here we
suggest and show that it might be better to share knowledge about the properties of good
features instead of knowledge about which features are good. The problem of handwritten
digit recognition is used throughout this chapter to illustrate the various applications of our
novel approach.
Related work: [108] used meta-features of words for text classication when there are
features (words) that are unseen in the training set, but appear in the test set. In their work
the features are words and the meta-features are words in the neighborhood of that word.
They used the meta-features to predict the role of words that are unseen in the training
set. Generalization from observed (training) features to unobserved features is discussed in
[66]. Their approach involves clustering the instances based on the observed features. What
these works and ours have in common is that they all extend the learning from the standard
instance-label framework to learning in the feature space. Our formulation here, however, is
dierent and allows a mapping of the feature learning problem onto the standard supervised
5.1. Formal Framework 83
learning framework (see Table 5.1). Another related model is Budget Learning ([75, 42]),
that explores the issue of deciding which is the most valuable feature to measure next under
a limited budget. Other ideas using feature properties to produce or select good features can
be found in the literature and have been used in various applications. For instance [71] used
this rationale in the context of inductive transfer for object recognition. Very recently, [95]
also used this approach in the same context for text classication. They use a property of
pairs of words which indicates whether they are synonyms or not for the task of estimating
the words' covariance matrix. [58] used property-based clustering of features for handwritten
Chinese recognition and other applications. Our formulation encompasses a more general
framework and suggests a systematic way to use the properties as well as derive algorithms
and generalization bounds for the combined process of feature selection and classication.
5.1 Formal Framework

Recall that in the standard supervised learning framework (see section 1.1.1) it is assumed
that each instance is a vector x ∈ RN and the N coordinates are the features. However, we
can also consider the instances as abstract entities in the space S and think of the features
as measurements on the instances. Thus each feature f can be considered as a function from
S to R, i.e., f : S → R. We denote the set of all the features by {fj }N

j=1 . In the following
we use the term feature to describe both raw input variables (e.g., pixels in an image) and
variables constructed from the original input variables using some function (e.g., product of
3 pixels in the image).
Also, recall that the standard task of feature selection is to select a subset of the given
N features that enables good prediction of the label (see section 1.2). As described in the
previous chapters, this is done by looking for features which are more useful than the others.
However, as already mentioned, in this chapter we want to extend this framework and to
nd what characterizes the better features. Thus we further assume that each feature is
described by a set of properties u (·) = {ur (·)}kr=1 which we call meta-features. Formally,
each ur (·) is a function from the space of possible measurements to R. Thus each feature f
is described by a vector u(f ) = (u1 (f ) , . . . , uk (f )) ∈ Rk . Note that the meta-features are

Table 5.1: Feature learning by meta-features as a standard supervised learning
Training set Features described by meta-features

Test set Unobserved features
Labels Feature quality
Hypothesis class Class of mappings from meta-features to quality
Generalization in feature selection Predicting the quality of new features
Generalization in the joint problem Low classication error
not dependent on the instances, a fact which is important for understanding our analysis in
what follows. We also denote a general point in the image of u (·) by u. The new framework
we introduce in this chapter forces us to introduce some not very standard notations. In
order to make it easier for the reader to become familiarized with this notation we provide
a summary table in the appendix A at the end of this chapter.
5.2 Predicting the Quality of Features

Before we describe the more useful applications, we start by describing the most obvious use
of the framework presented in this chapter. Here we assume that we observe only a subset
of the N features; i.e., that in the training set we see only the value of some of the features.
We can directly measure the quality (i.e. usefulness) of these features using the training set,
but we also want to be able to predict the quality of the unseen features. Thus we want to
think of the training set not only as a training set of instances, but also as a training set
of features.
Thus we want to predict the quality of a new unseen feature. At this stage we evaluate
the quality of each feature alone .

2 More formally, our goal is to use the training set Sm
and the set of meta-features for learning a mapping Q̂ : Rk −→ R that predicts the quality
of a feature using the values of its meta-features. For this we assume that we have a
way to measure the quality of each feature that does appear in the training set. This
measure can be any kind of standard evaluation function that uses the labeled training set
to evaluate features (e.g., Infogain or wrapper based). YM F denotes the vector of measured
2 We discuss the evaluation of subsets in sections 5.3 and 5.6.

5.2. Predicting the Quality of Features 85
Algorithm 5 Q̂ =quality_map(S m , featquality, regalg)

1. measure the feature quality vector: YM F = featquality (S m )
2. calculate the N ×k meta features matrix XM F
3. use the regression alg. to learn a mapping from meta feature value to quality: Q̂ =
regalg (XM F , YM F )
qualities, i.e. YM F (j) is the measured quality of the j 's feature in the training set. Now
we have a new supervised learning problem, with the original features as instances,
the meta-features as features and YM F as the (continuous) target label. The analogy
to the standard supervised problem is summarized in Table 5.1. Thus we can use any
standard regression learning algorithm to nd the required mapping from meta-features to
quality. The above procedure is summarized in algorithm 5. Note that this procedure uses
a standard regression learning procedure. That is, the generalization ability to new features
can be derived using standard generalization bounds for regression learning.
We now present a toy illustration of the ability to predict the quality of unseen features
by the above procedure using a handwritten digit recognition task (OCR). For this purpose,
we used the MNIST ([70]) dataset which contains images of 28 × 28 pixels of centered digits
(0 . . . 9). We converted the pixels from gray-scale to binary by thresholding. Here we use
the 784 original pixels as the input features and the (x, y) location of the pixel as meta-
features. Recall that we do not try to generalize along the instance dimension, but instead
try to generalize along the feature dimension, using the meta-features. Thus we rst choose
a xed set of 2000 images from the dataset. Then we put aside a random subset of the 392
features as a test set of features. Then we use a growing number of training features, which
were chosen randomly out of the remaining 392 features. Our measure of quality here is
Infogain. Now, given a training set of features we use a linear regression

3 to learn a mapping
from meta-features to quality. Using this mapping we predict the quality of the 392 testing
features.
We check the accuracy of the prediction by comparing it to the quality that was measured
3 The linear regression is done over an RBF representation of the (x, y) location. We used49 Gaussians
(with std=3 pixels) located on a grid over the image and represent the location by a 49-dimensional vector
of the responses of the Gaussians in this location.
0.9
Corr. coef. (ρ)
0.8
0.7
0.6
50 100 150 200 250 300 350 400
Number of features used for training
Figure 5.1: The correlation coecient between the predicted quality and the real quality of
testing features as a function of the number of training features. Error bars show the
standard deviation.
by directly applying Infogain on the testing features. We do the comparison in two dierent
ways: (1) by correlation coecient between the two. (2) By checking the accuracy of the
1-Nearest-Neighbor (1-NN) classier on a test set of 2000 instances, using a growing number
of selected features out of the 392 testing features. The features were sorted by either the
predicted quality or the direct Infogain. In order to check the statistical signicance, we
repeat the experiment 20 times by re-splitting into train and test sets. The results are
presented in Figure 5.1 and Figure 5.2 respectively. It is clear from the graphs that if we
have enough training features, we can predict new feature qualities with high accuracy. In
Figure 5.2 we see that the classication error of using the predicted quality is similar to the
error obtained by direct use of Infogain. This occurs even though our prediction was done
using only ∼1/8 of the features.
5.3 Guided Feature Extraction

Here we show how the low dimensional representation of features by a relatively small number
of meta-features enables ecient selection even when the number of potential features is very
large or even innite. This is highly relevant to the feature extraction scenario. We tackle
the problem of choosing which features to generate out of the huge number of potential
features in the following way. Instead of evaluating all features, we evaluate a relatively
5.3. Guided Feature Extraction 87
0
10
Regression
Direct Measure
Test error
Rand
0 50 100 150 200 250 300 350 400

Number of features used for classification
Figure 5.2: The classication error of 1-NN on a testing set of instances as a function of
the number of the top features ranked by either prediction using 100 training features or
by direct Infogain on test features. The results of random ranking is also presented. Error
bars show the standard deviation.
small number of features and learn to predict which other features are better. This way we
can guide our search for good features.
5.3.1 Meta-features Based Search
Assume that we want to select (or extract) a set of n features out of the large number of N
potential features. We dene a stochastic mapping from values of meta-features to selection
(or extraction) of features. More formally, let V be a random variable that indicates which
feature is selected. We assume that each point u in the meta-feature space induces density
p (v|u) over the features. Our goal is to nd a point u in the meta-feature space such that
drawing n features (independently) according to p (v|u) has a high probability of giving us
a good set of n features. For this purpose we suggest the Mufasa (for Meta-Features Aided
Search Algorithm) (algorithm 6) which uses stochastic local search in the meta-feature space.
Note that Mufasa does not use explicit prediction of the quality of unseen features as we
did in section 5.2, but it is clear that it cannot work unless the meta-features are informative
on the quality. Namely, Mufasa can only work if the chance of drawing a good set of features
from p (v|u) is some continuous function of u, i.e., a small change in u results in a small
change in the chance of drawing a good set of features. If, in addition, the meta-features
Algorithm 6 Fbest =Mufasa(n, J)

1. Initialization: qbest = maxreal, u ← u0 , where u0 is the initial guess of u.
2. For j = 1...J
(a) Select (or generate) a new set Fj of n random features according to p (w|u).
(b) qj = quality (Fj ). (Any measure of quality, e.g. cross-validation classication
accuracy)
(c) If qj ≥ qbest
Fbest = Fj , ubest = u, qbest = qj
(d) Randomly select new u which is near ubest . For example: u ← ubest + noise.
3. return Fbest
space is simple
4 we expect it to nd a good point in a small number of steps. The number
of steps plays an important role in the generalization bounds we present in section 5.4.1. Like
any local search over a non-convex target function, a convergence to a global optimum is not
guaranteed, but any standard technique to avoid local maxima can be used. In section 5.3.2
we demonstrate the ability of Mufasa to eciently select good features in the presence of a
huge number of candidate (extracted) features on the handwritten digit recognition problem.
In section 5.4.1 we present a theoretical analysis of Mufasa.
5.3.2 Illustration on Digit Recognition Task

In this section we revisit the handwritten digit recognition problem described in section 5.2
and use it to demonstrate how Mufasa works for feature extraction. Once again we use the
MNIST dataset, but this time we deal with more complicated extracted features. These
extracted features are in the following form: logical AND of 1 to 8 pixels (or their negation),
which are referred to as inputs. In addition, a feature can be shift invariant for shifts of up
to shiftInvLen pixels. That means we take a logical OR of the output of all the dierent
locations we can get by shifting.
Thus a feature is dened by specifying the set of inputs, which of the inputs are negated,
and the value of shiftInvLen. Similar features were already used by [30] on the MNIST
dataset, but with a xed number of inputs and without shift invariance. The idea of using
4 We elaborate on the meaning of simple in sections 5.4 and 5.6.

shift invariance for digit recognition is also not new, and was used for example by [104]. It
is clear that there are a huge number of such features; thus we have no practical way to
measure or use all of them. Therefore we need some guidance for the extraction process,
and here is the point where the meta-features framework comes in.
We use the following four meta-features:
1. numInputs : the number of inputs (1-8).
2. precentPos : percent of logic positive pixels (0-100, rounded).
3. shiftInvLen : maximum allowed shift value (1-8).
4. scatter : average distance of the inputs from their center of gravity (COG) (1-3.5).
In order to nd a good value for meta-features we use Mufasa (algorithm 6), with dierent
values of allowed budget. We use a budget (and not just the number of features) since
features with large shiftInvLen are more computationally expensive. We dened the cost of
( )
measuring (calculating) a feature as 0.5 1 + a2 , where a is the shiftInvLen of the feature;
this way the cost is proportional to the number of locations we measure the feature. We
use 2000 images as a training set, and the number of steps, J, is 50. We choose specic
features, given a value of meta-features, by re-drawing features randomly from a uniform
distribution over the features that satisfy the given value of the meta-features until the
full allowed budget is used up. We use 2-fold cross validation of the linear Support Vector
Machine (SVM) ([14]) to check the quality of the set of selected features in each step. We
use the multi-class SVM toolbox developed by [23]. Finally, for each value of allowed budget
we check the results obtained by the linear SVM (that uses the selected features) on a test
set of another 2000 images using the selected features.
We compare the results with those obtained using the features selected by Infogain as
follows. We rst draw features randomly using a budget which is 50 times larger, then
we sort them by Infogain normalized by the cost

5 and take a prex that uses the allowed
budget (referred as MF+Norm Infogain). As a sanity check, we also compare the results
to those obtained by doing 50 steps of choosing features of the allowed budget randomly,
i.e. over all possible values of the meta-features. Then we use the set with the lowest
2-fold cross-validation error (referred as MF+Rand). We also compare our results with
5 Infogain without normalization gives worse results.

Mufasa
MF + Rand
MF + Norm. Infogain
Test Error
Polynomial SVM
−1
10
0.5 1 1.5 2 2.5

Total budget 4
x 10
Figure 5.3: Guided feature extraction for digit recognition. The generalization error
rate as a function of the available budget for features, using dierent selection methods.
The number of training instances is 2000. Error bars show a one standard deviation
condence interval. SVM is not limited by the budget, and always implicitly uses all the
products of features. We only present the results of SVM with a polynomial kernel of
degree 2, the value that gave the best results in this case.
SVM with a polynomial kernel of degree 1-4, that uses the original pixels as input features.
This comparison is relevant since SVM with a polynomial kernel of degree k implicitly uses
ALL the products of up to k pixels, and the product is equal to AND for binary pixels.
To evaluate the statistical signicance, we repeat each experiment 10 times, with dierent
partitions into train and test sets. The results are presented in Figure 5.3. It is clear that
Mufasa outperforms the budget-dependent alternatives, and outperforms SVM for budgets
larger than 3000 (i.e. about 600 features). It is worth mentioning that our goal here is not to
compete with the state-of-art results on MNIST, but to illustrate our concept and to compare
the results of the same kind of classier with and without using our meta-features guided
search. Note that our concept can be combined with most kinds of classication, feature
selection, and feature extraction algorithms to improve them as discussed in section 5.8.
5 5
Inputs # Shift inv
4 4
3 3
Y−axis: Optimal value
2 2
1 1
2 3 4 2 3 4
(a) (b)
100 4
% Positive Scatter
3
50
2
0 1
2 3 4 2 3 4
(c) (d)
X−axis: Log10(total budget)
Figure 5.4: Optimal value of the dierent meta features as a function of the
budget. Error bars indicate the range where values fall in 90% of the runs. We can see
that the size of the optimal shift invariance, the optimal number input and optimal percent
of positive inputs grow with the budget.
Another benet of the meta-features guided search is that it helps understand the prob-
lem. To see this we need to take a closer look at the chosen values of the meta-features
(ubest ) as a function of the available budget. Figure 5.4 presents the average chosen value
of each meta-feature as a function of the budget. We can see (in Figure 5.4b) that when
the budget is very limited, it is better to take more cheap features rather than fewer more
expensive shift invariant features. On the other hand, when we increase the budget, adding
these expensive complex features is worth it. We can also see that when the budget grows,
the optimal number of inputs grows as does the optimal percent of positive inputs . This oc-
curs because for a small budget, we prefer features that are less specic, and have relatively
high entropy, at the expense of in class variance. For a large budget, we can permit our-
selves to use sparse features (low probability of being 1), but gain specicity. For the scatter
meta-features, there is apparently no correlation between the budget and the optimal value.
The vertical lines (error bars) represent the range of selected values in the dierent runs. It
gives us a sense of the importance of each meta-feature. A smaller error bar indicates higher
sensitivity of the classier performance to the value of the meta-feature. For example, we
can see that performance is sensitive to shiftInvLen and relatively indierent to percentPos.
5.4 Theoretical Analysis

In this section we derive generalization bounds for the combined process of selection and
classication, when the selection process is based on meta-features. We show that in some
cases, these bounds are far better than the bounds that assume each feature can be selected
directly. This is because we can signicantly narrow the number of possible selections, and
still nd a good set of features. In section 5.4.1 we perform the analysis for the case where
the selection is made using Mufasa (algorithm 6). In section 5.4.2 we present a more general
analysis, which is independent of the selection algorithm, but instead assumes that we have
a given class of mappings from meta-features to a selection decision.
5.4.1 Generalization Bounds for Mufasa Algorithm

The bounds presented in this section assume that the selection is made using Mufasa (al-
gorithm 6), but they could be adapted to other meta-feature based selection algorithms.
Before presenting the bounds, we need some additional notations. We assume that the clas-
sier that is going to use the selected features is chosen from a hypothesis class Hc of real
valued functions and the classication is done by taking the sign.
We also assume that we have a hypothesis class Hf s , where each hypothesis is one
possible way to select the n out of N features. Using the training set, our feature selection is
limited to selecting one of the hypotheses that is included in Hf s . As we show later, if Hf s
contains all the possible ways of choosing n out of N features, then we get an unattractive
generalization bound for large values of n and N. Thus we use meta-features to further
restrict the cardinality (or complexity) of Hf s . We have a combined learning scheme of
choosing both hc ∈ Hc and hf s ∈ Hf s . We can view it as choosing a single classier from
Hf s × Hc . In the following paragraphs we analyze the equivalent size of hypothesis space

Hf s of Mufasa as a function of the number of steps of the algorithm.
For the theoretical analysis, we can reformulate Mufasa into an equivalent algorithm
that is split into two stages, called Mufasa1 and Mufasa2. The rst stage ( Mufasa1 ) is
randomized, but does not use the training set. The second stage ( Mufasa2 ) is deterministic,
and performs the described search. Since Mufasa2 is deterministic, we must generate all po-
tentially required hypotheses in Mufasa1, using the same probability distribution of feature
generating that may be required by step 2a of the algorithm. Then, during the search, the
second stage can deterministically select hypotheses from the hypotheses created in the rst
stage, rather than generate them randomly. Eventually, the second stage selects a single
good hypothesis out of the hypotheses created by Mufasa1 using the search. Therefore, the
size of Hf s , denoted by |Hf s |, is the number of hypotheses created in Mufasa1.

The following two lemmas upper bound |Hf s |, which is a dominant quantity in the
generalization bound. The rst one handles the case where the meta features has discrete
values, and there are a relatively small number of possible values for the meta-features. This
number is denoted by |M F |.
Lemma 5.1 Any run of Mufasa can be duplicated by rst generating J|M F | hypotheses by
Mufasa1 and then running Mufasa2 using these hypotheses only, i.e. using |Hf s | ≤ J|M F |,
where J is the number of iterations made by Mufasa and |M F | is the number of dierent
values the meta-features can be assigned.
Proof. Let Mufasa1 generate J random feature sets for each of the |M F | possible values
of meta-features. The total number of sets we get is J|M F |. We have only J iterations in
the algorithm, and we generated J feature sets for each possible value of the meta-features.
Therefore, it is guaranteed that all hypotheses required by Mufasa2 are available.
Note that in order to use the generalization bound of the algorithm, we cannot consider
only the subset of J hypotheses that was tested by the algorithm. This is because this subset
of hypotheses is aected by the training set (just as one cannot choose a single hypothesis
using the training set, and then claim that the hypotheses space of the classier includes
only one hypothesis). However, from lemma 5.1, we get that the algorithm search within
no more than J|M F | feature selection hypotheses, that were determined without using the
training set.
The next lemma handles the case where the cardinality of all possible values of meta-
features is large relative to 2J , or even innite. In this case we can get a tighter bound that
depends on J but not on |M F |.
Lemma 5.2 Any run of Mufasa can be duplicated by rst generating 2J hypotheses by
Mufasa1 and then running Mufasa2 using these hypotheses only, i.e. using |Hf s | ≤ 2J ,
where J is the number of iterations Mufasa performs.
Proof. In Mufasa1, we build the entire tree of possible random updating of meta-features
(step 2d of algorithm 6) and generate the features. Since we do not know the label, in the
j th iteration we have 2j−1 hypotheses we need to generate. The total number of hypotheses
( J )
is then 2 − 1 .
To state our theorem we also need the following standard denitions:
Denition 5.1 Let D be a distribution over S × {±1} and h : S → {±1} a classication

function. We denote by erD (h) the generalization error of h w.r.t D:
erD (h) = P rs,y∼D [h (s) 6= y]
For a sample S m = {(sk , yk )}m

k=1 ∈ (S × {±1})
m
and a constant γ > 0, the γ -sensitive
training error is:
1
ˆ γS (h) =
er |{i : yi h (si ) < γ}|
m
Now we are ready to present our main theoretical results:
Theorem 5.1 Let Hc be a class of real valued functions. Let S be a sample of size m
generated i.i.d from a distribution D over S × {±1}. If we choose a set of features using
Mufasa, with a probability of 1 − δ over the choices of S , for every hc ∈ Hc and every
γ ∈ (0, 1]:
ˆ γS (hc )+
erD (hc ) ≤ er
√ ( ( ) ( ) )
2 34em 8
d ln log(578m) + ln + g (J)
m d γδ
where d = f atHc (γ/32) and g (J) = min (J ln 2, ln (J|M F |)) (where J is the number of steps
Mufasa does and |M F | is the number of dierent values the meta-features can be assigned,
if this value is nite, and ∞ otherwise).
Our main tool in proving the the above theorem is the following theorem:
Theorem 5.2 (Bartlett, 1998)

Let H be a class of real valued functions. Let S be a sample of size m generated i.i.d
from a distribution D over S × {±1} ; then with a probability of1 − δ over the choices of S ,
every h ∈ H and every γ ∈ (0, 1]:
ˆ γS (h)+
erD (h) ≤ er
√ ( ( ) ( ))
2 34em 8
d ln log(578m) + ln
m d γδ
where d = f atH (γ/32)
Proof. (of theorem 5.1)

{ }
Let F1 , ..., F|Hf s | be all possible subsets of selected features. From theorem 5.2 we
know that
ˆ γS (hc , Fi )+
erD (hc , Fi ) ≤ er
√ ( ( ) ( ))
2 34em 8
d ln log(578m) + ln
m d γδFi
where erD (hc , Fi ) denote the generalization error of the selected hypothesis for the xed
set of features Fi .
By choosing δF = δ/|Hf s | and using the union bound, we get that the probability that
there exist Fi (1 ≤ i ≤ |Hf s |) such that the equation below does not hold is less than δ
ˆ γS (hc ) +
erD (hc ) ≤ er
√ ( ( ) ( ) )
2 34em 8
d ln log(578m) + ln + ln |Hf s |
m d γδ
Therefore, with a probability of 1 − δ the above equation holds for any algorithm that
{ }
selects one of the feature sets out of F1 , ..., F|Hf s | . Substituting the bounds for |Hf s | from
lemma 5.1 and lemma 5.2 completes the proof.
An interesting point in this bound is that it is independent of the number of selected
features and of the total number of possible features (which may be innite in case of feature
( )
generation). Nevertheless, it can select a good set of features out of O 2J candidate sets.
These sets may be non-overlapping, so the potential number of features that are candidates
( )
is O n2J . For comparison, in chapter 3 we presented a same kind of bound but for direct
feature selection. The bound we presented there has the same form as the bound we present
here, but where g (J) is replaced by a term of O (ln N ), which is typically much larger
than J ln 2. If we substitute N = n2J , then for the experiment described in section 5.3.2,
n ln N = Jn (ln 2n) ∼
= 375000 while ln (J|M F |) ∼
= 11.
5.4.2 VC-dimension of Joint Feature Selection and Classication

In the previous section we presented an analysis which assumes that the selection of features
is made using Mufasa. In this section we turn to a more general analysis, which is indepen-
dent of the specic selection algorithm, and rather assumes that we have a given class Hs
of mappings from meta-features to a selection decision. Formally, Hs is a class of mappings
from a meta-features value to {0, 1}, that is, for each hs ∈ Hs , hs: : Rk → {0, 1}. hs denes
which features are selected as follows:
f is selected ⇐⇒ hs (u (f )) = 1
where, as usual, u (f ) is the value of the meta-features for feature f. Given the values
of the meta-features of all the features together with hs we get a single feature selection
hypothesis. Therefore, Hs and the set of possible values of meta-features indirectly denes
our feature selection hypothesis class, Hf s . Since we are interested in selecting exactly n
features (n is predened), we use only a subset of Hs where we include only functions that
imply selection of n features .

6 For simplicity, in the analysis we use the VC-dim of Hs
6 Note that one valid way to dene H
s is by applying a threshold on a class of mappings from meta-features
value to feature quality, Q̂ : Rk → R. See, e.g., example 2 at the end of this section.
without this restriction, which is an upper bound of the VC-dim of the restricted class.
Our goal is to calculate an upper bound on the VC-dimension ([117]) of the joint problem
of feature-selection and classication. To achieve this, we rst derive an upper bound on
|Hf s | as a function of VC-dim (Hs ) and the number of features N.
Lemma 5.3 Let Hs be a class of mappings from the meta-feature space (Rk ) to {0, 1}, and
let Hf s be the induced class of feature selection schemes; the following inequality holds:
( )VC-dim(Hs )
eN
|Hf s | ≤
VC-dim (Hs )
Proof. The above inequality follows directly from the well known fact that a class with
( em )d
VC-dim d cannot give more than
d dierent partitions of a sample of size m (see, for
example, [60] pp. 57).
The next lemma relates the VC dimension of the classication concept class (dc ), the
cardinality if the selection class (|Hf s |) and the VC-dim of the joint learning problem.
Lemma 5.4 Let Hf s be a class of the possible selection schemes for selecting n features out
of N and let Hc be a class of classiers over Rn . Let dc = dc (n) be the VC-dim of Hc . If
dc ≥ 11 then the VC-dim of the combined problem (i.e. choosing (hf s , hc ) ∈ Hf s × Hc ) is
bounded by (dc + log |Hf s | + 1) log dc .
Proof. For a given set of selected features, the possible number of classications of m in-
( )dc
em
stances is upper bounded (see [60] pp. 57). Thus, for the combined learning problem,
dc
( ) dc
the total number of possible classications of m instances is upper bounded by |Hf s | em
dc .
( ) dc
The following chain of inequalities shows that if m = (dc + log |Hf s | + 1) log dc then |Hf s | em
dc <
2m :
( )dc ( )d
e (dc + log |Hf s | + 1) log dc dc log |Hf s | + 1 c
|Hf s | = |Hf s | (e log dc ) 1+
dc dc
1+log e dc
≤ e (|Hf s |) (e log dc ) (5.1)
2+log e dc
≤ (|Hf s |) (e log dc ) (5.2)
2+log e
≤ (|Hf s |) ddc c +1 (5.3)
log dc
≤ ddc c +1 (|Hf s |) (5.4)
(log |Hf s |)
= ddc c +1 dc (5.5)
= 2(dc +1+log |Hf s |) log dc
where we used the following equations / inequalities:
(5.1) (1 + a/d)d ≤ ea ∀a, d > 0
(5.2) here we assume |Hf s | > e, otherwise the lemma is trivial
(5.3) (e log d)d ≤ dd+1 ∀d ≥ 1
(5.4) log dc > 2 (since dc ≥ 11)
(5.5) alog b = blog a ∀a, b > 1
Therefore, (dc + log |Hf s | + 1) log dc is an upper bound on VC-dim of the combined learn-
ing problem.
Now we are ready to state the main theorem of this section.
Theorem 5.3 Let Hs be a class of mappings from the meta-feature space (Rk ) to {0, 1}, let
Hf s be the induced class of feature selection schemes for selecting n out of N features and
let Hc be a class of classiers over Rn . Let dc = dc (n) be the VC-dim of the Hc . If dc ≥ 11,
then the VC-dim of the joint class Hf s × Hc is upper bounded as follows
( )
eN
VC-dim (Hf s × Hc ) ≤ dc + ds log + 1 log dc
ds
where ds is the VC-dim of Hs .
The above theorem follows directly by substituting lemma 5.3 in lemma 5.4.
5.5. Inductive Transfer 99
To illustrate the gain of the above theorem we calculate the bound for a few specic
choices of Hs and Hc :
1. First, we should note that if we do not use meta-features, but consider all the possible
ways to select n out of N features the above bound is replaced by
   
  N  
dc + log   + 1 log dc (5.6)
n
which is very large for reasonable values of N and n.
2. Assuming that both Hs and Hc are classes of linear classiers on Rk and Rn respec-
tively, then ds = k + 1 and dc = n + 1 and we get that the VC of the combined problem
of selection and classication is upper bounded by
O ( (n + k log N ) log n )
If Hc is a class of linear classiers, but we allow any selection of n features the bound
is (by substituting in 5.6):
O ((n + n log N ) log n)
which is much larger if k ¿ n. Thus in the typical case where the number of meta-
features is much smaller than the number of selected features (e.g. in section 5.3.2)
the bound for meta-feature based selection is much smaller.
3. Assuming that the meta-features are binary and Hs is the class of all possible functions
from meta-feature to {0, 1}, then ds = 2k and the bound is
(( ) )
O dc + 2k log N log dc
which is still might be much better than the bound in equation 5.6 if k ¿ log n.
5.5 Inductive Transfer

Here we show how the usage of meta features can be employed for inductive transfer ([10,
110, 16]). As explained in section 1.1.5 of the introduction, inductive transfer refers to the
problem of applying the knowledge learned in one or more tasks to the learning of a new
task (the target task). Here, in the context of feature selection, we interested in better
prediction of feature quality for a new task using its quality in other tasks. We assume
that the dierent tasks use the same set of features. We also assume that during the stage
we required to estimate feature quality we did not yet nd any instances of the new task.
A straightforward prediction is to use the average quality of the feature over the previous
tasks. However we suggest using the quality in previous tasks of other features with similar
properties, i.e. with similar values of meta-features. This make sense as usually not the
exact same features are good for dierent tasks, but rather, good features of dierent tasks
share similar properties. In practice this is done by rst using the training tasks to learn
a regression from meta-features to feature quality (similar to the procedure described in
algorithm 5), and then using the regression to predict the quality of the features for the new
task.
The notion of related tasks arises in many works on inductive transfer. Task is usually
dened as related to the task of interest if using the knowledge gathered in this task improves
performance on the current task. However, this notion is tricky as it may depend on a variety
of parameters such as the choice of the learning algorithm ([16]). Here we show that the
level of improvement depends highly on the kind of knowledge we choose to transfer. It is
much better to transfer which kind of features are good rather than which specic features
are good. This is achieved by using meta-features.
5.5.1 Demonstration on Handwritten Digit Recognition

To demonstrate the use of of meta-features for inductive transfer we consider the following
collection of tasks: a task is dened by a pair of digits (e.g.,

  3 and 7), and the task is to
 10 
classify images of these two digits. Thus we have   = 45 dierent tasks to choose
2
from. To avoid any possible overlap between the training tasks and the testing tasks, we rst
randomly divide the digits 0 . . . 9 into two disjoint sets of ve digits. Then, we choose pairs of
digits for the training tasks from the rst set, and for the testing tasks from the second set.
For each training task we calculate the quality of each feature (pixel) by infogain, and take
5.6. Choosing Good Meta-features 101
the average over all the training tasks. Then, we predict the quality of each feature for the
test task in two dierent ways: (1) directly, as equal to the average quality on the training
tasks. (2) indirectly, by learning a regression from the meta features to the quality. Finally,
we check the correlation coecient between the two dierent predicted qualities and the
real quality, i.e., the infogain calculated using many examples of the test task. To evaluate
the statistical signicance of our results, we repeat this experiment for all possible choices
of training tasks and a test tasks. We try to use two dierent kinds of meta features: exact
(x, y) coordinates and only the distance from the center of the image, r. The results are
presented in Figure 5.5. We can see that the indirect prediction that uses the meta-features
achieves much better prediction, and that using r as a meta feature outperforms using (x, y)
as meta-features. This also suggests that the relatedness of tasks depends on the kind of
knowledge we choose to transfer and on the way we transfer it.
5.6 Choosing Good Meta-features

Choosing good meta-features for a given problem is obviously crucial. It is also worth
inquiring why the problem of nding good meta-features is easier than the problem of nding
good features. As we show in section 5.3.2, in the presence of a huge number of candidate
features it is easier to guess which properties might be indicative of feature quality than
to guess which exact features are good. In this section we give general guidelines on good
choices of meta-features and mappings from meta-feature values to selection of features.
First, note that if there is no correlation between meta-features and quality, the meta-
features are useless. In addition, if any two features with the same value of meta-features
are redundant (highly correlated), we gain almost nothing from using a large set of them.
In general, there is a trade-o between two desired properties:
1. Features with the same value of meta-features have similar quality.
2. Low redundancy between features with the same value of meta-features.
When the number of features we select is small, we should not be overly concerned about
redundancy and rather focus on choosing meta-features that are informative on quality. On
0.75
0.7
0.65
0.6
ρ
0.55
0.5
0.45 Direct estimation

Reg. from MF ( (x,y)−location )
Reg. from MF (radius)
0.4
0 2 4 6 8 10 12
Number of training tasks
Figure 5.5: Inductive Transfer by meta-features for OCR . The correlation

coecient between the predicted and real quality of features as a function of number of
training tasks. The results are averaged over all the possible choices of training tasks and
test task repetitions. It is clear that indirect prediction (using meta-features) signicantly
outperforms the direct prediction.
5.6. Choosing Good Meta-features 103
the other hand, if we want to select many features, redundancy may be dominant, and this
requires our attention. Redundancy can also be tackled by using distribution over meta-
features instead of a single point.
In order to demonstrate the above trade-o we carried out one more experiment using
the MNIST dataset. We used the same kind of features as in section 5.3.2, but this time
without shift-invariance and with xed scatter. The task was to discriminate between 9
and 4 and we used 200 images as the training set and another 200 as the test set. Then
we used Mufasa to select features, where the meta-features were either the (x, y)-location
or the number of inputs. When the meta-feature was (x, y)-location, the distribution of
selecting the features, p (v|u), was uniform in a 4×4 window around the chosen location
(step 2a in Mufasa ). Then we checked the classication error on the test set of a linear SVM
(which uses the selected features). We repeated this experiment for dierent numbers of
7
features . The results are presented in Figure 5.6. When we use a small number of features,
it is better to use the (x, y)-location as a meta-feature whereas when using many features it
is better to use the number of inputs as a meta-feature. This supports our prediction about
the redundancy-homogeneity trade-o. The (x, y)-locations of features are good indicators
of their quality, but features from similar positions tend to be redundant. On the other
hand, constraints on the number of inputs are less predictive of feature quality but do not
cause redundancy.
Rand
(X,Y)
Test error
−1
10 Inputs #
2 3
10 10
Number of features
Figure 5.6: Dierent choices of meta-features. The generalization error as a function

of the number of selected features. The two lines correspond to using dierent
meta-features: (x, y)-location or number of inputs. The results of random selection of
features are also presented.
7 We do not use shift invariance here, thus all the features have the same cost.
5.7 Improving Selection of Training Features
The applications of meta-features presented so far involve predicting the quality of unseen
features. However, the meta-features framework can also be used to improve the estimation
of the quality of features that we do see in the training set. We suggest that instead of
using direct quality estimation, we use some regression function on the meta-features space
(as in algorithm 5). When we have only a few training instances, direct approximation of
the feature quality is noisy, thus we expect that smoothing the direct measure by using
a regression function of the meta-features may improve the approximation. One initial
experiment we conducted in this way shows the potential of this method, but also raises
some diculties.
Once again we used 2000 images of the MNIST dataset with the 784 pixels as input
features, and the (x, y) location as meta-features. We used algorithm 5 with Infogain and
linear regression over RBF (see section 5.2 for details) to produce a prediction of the feature
quality, and compared it to the direct measure of the Infogain in the following way: we cal-
culated the Infogain of each feature using all 2000 instances as a reference quality measure,
and compared its correlation coecient with the two alternatives. The results are presented
in Figure 5.7. It is clear that when the number of training instances is small enough (below
100) the indirect prediction that uses meta-features outperforms the direct measure in pre-
dicting the real Infogain. However, there was no signicant improvement in the error rate of
classication. We believe that this was due to the limitations of Infogain which ignores the
redundancy between the features and the fact that in this choice of meta-features, features
with similar values of meta-features are usually redundant. It might be possible to solve
this problem by a dierent choice of meta-features. Another more fundamental problem is
that if, by chance, a good feature is constant on the training set, our indirect method may
predict correctly that it is good quality feature, but this feature is useless for classication
using the current training set . The question of how to overcome this problem is still open.
5.8. Summary 105
0.98
0.96
0.94
ρ
0.92
0.9
Regression
Direct Measure
0.88
0 100 200 300 400 500 600
Number of training instances
Improving quality estimation from a few training instances for the

Figure 5.7:
OCR problem. The correlation coecient of direct Infogain and prediction using
meta-features with the true Infogain, as a function of the number of training instances. It
is clear that when there are less than 100 training instances, the indirect measure is better.
5.8 Summary
In this chapter we presented a novel approach to feature selection. Instead of just selecting a
set of better features out of a given set, we suggest learning the properties of good features.
We demonstrated how this technique may help in the task of feature extraction, or feature
selection in the presence of a huge number of features and in the setting of inductive transfer.
We showed that when the selection of features is based on meta-features it is possible to
derive better generalization bounds on the combined problem of selection and classication.
We illustrated our approach on a digit recognition problem, but we also expect our methods
to be very useful in many other domains. For an example, in the problem of tissue classi-
cation according to a gene expression array where each gene is one feature, ontology-based
properties may serve as meta-features. In most cases in this domain there are many genes
and very few training instances; therefore standard feature selection methods tend to over-t
and thus yield meaningless results ([31]). A meta-feature based selection can help as it may
reduce the complexity of the class of possible selections.
In section 5.3 we used meta-features to guide feature extraction. Our search for good
features is computationally ecient and has good generalization properties because we do
not examine each individual feature. However, avoiding examination of individual features
may also be considered as a disadvantage since we may include some useless individual
features. This can be solved by using a meta-features guided search as a fast but rough lter
for good features, and then applying more computationally demanding selection methods
that examine each feature individually.
A Notation Table for Chapter 5
The following table summaries the notation and denitions introduced in chapter 5 for quick
reference.
A. Notation Table for Chapter 5 107
Notation Short description Sections
meta-feature a property that describes a feature
k number of meta-features
S (abstract) instances space 5.1, 5.4.1
f a feature. formally, f :S→R
c a classication rule
Sm Sm is a labeled set of instances (a training set) 5.1, 5.2, 5.4.1
ui (f ) the the value of thei's meta-feature on feature f 5.1, 5.3
u (f ) u (f ) = (u1 (f ) , . . . , uk (f )), a vector that describes the feature f 5.1, 5.3, 5.4.1
u a point (vector) in the meta-feature space 5.1, 5.3
Q̂ a mapping from meta-features value to feature quality 5.2, 5.4.2
YM F measured quality of the features (e.g. Infogain) 5.2
XM F XM F (i, j) = the value of the j 's meta-feature on the i's feature 5.2
F, Fj a set of features 5.3
V a random variable indicating which features are selected 5.3
p (v|u) the conditional distribution of V given meta-features values (u) 5.3, 5.6
hc a classication hypothesis 5.4.1
Hc the classication hypothesis class 5.4.1
dc the VC-dimension of Hc 5.4.2
hf s a feature selection hypothesis - says which n features are selected 5.4.1
Hf s the feature selection hypothesis class 5.4.1
hs a mapping from meta-feature space to {0, 1} 5.4.2
Hs class of mappings from meta-feature space to {0, 1} 5.4.2
ds the VC-dimension of Hs
J number of iterations Mufasa (algorithm 6) makes 5.3, 5.4.1
|M F | number of possible dierent values of meta-features 5.4.1

Chapter 6
Epilog
Learning is one of the most fascinating aspects of intelligence. The amazing learning ability of
humans is probably the main thing that distinguishes humans from other animals. Machine
learning tries to theoretically analyze learning processes and to mimic human-like learning
abilities using computers. Many dierent methods have been suggested for learning and it is
widely agreed that given a "good" representation of the data, many of these can give good
results. This observation is consistent with popular saying: asking the question in the right
way is half way to the solution . However, it is less clear how to build a good representation
of the data, i.e., a representation which is both concise and reveals the relevant structure
for the problem. One valid way to do so is rst to measure (or produce) large number
of candidate features which have a high probability of containing "good features" but also
contain many weakly relevant or noisy features which mask the better features, and then
use feature selection techniques to select only some of them. Thus, feature selection, the
task of selecting a small number of features that enable ecient learning, deals with the
fundamental question of nding good representations of data.
One thing to keep in mind is that feature selection is also a learning process in itself, and
therefore is exposed to over-t. Thus, care should always been taken in analyzing the results,
especially when the number of candidate features is huge and we only have a small number
of training instances. In this case the noise of the measurement of feature quality may be
108
109
too high and any selection algorithm that uses only the data will fail. Such a situation is
common in the eld of gene expression, as described by Ein-dor et al. [31]. Therefore for
these cases we need to adopt a dierent approach. One possible way is to use properties of
the features, as we suggest in chapter 5. Another option is to use information extracted in
other related tasks, as done in Inductive Transfer.
We believe that the selection algorithms we suggested in this thesis can work well on many
problems, but it is important to understand that any selection algorithm is based on some
assumptions. If these assumptions are violated the algorithm can fail. On the other hand,
if a stronger assumption holds, another algorithm that assumes this stronger assumption
might outperform the rst one. For example, a method that ranks individual features by
assigning a score to each feature independently assumes that complex dependency on sets
of features does not exist or is negligible. This assumption narrows the selection hypothesis
space, and therefore allows for generalization using fewer instances. Thus, if this assumption
is true, we would expect such a ranking to work better than methods that do not assume
this, i.e. methods that consider subsets and are able to reveal complex dependencies (as
these methods look in a larger hypothesis space). However, we cannot expect such rankers
to work well when this independency assumption is not true. To demonstration this, the
performance of SKS and RGS in gure 4.1 in chapter 4 provide a good example. SKS
and RGS use the same evaluation function, but SKS considers each feature independently
whereas RGS considers all the features at the same time. We can see that SKS outperforms
RGS on problem (a), where there are no complex dependencies, but completely fails on
problem (d), where complex dependencies exist.
Another example can be seen in chapter 3. Simba and G-ip are based on margins, and
therefore implicitly assume that the Euclidean distance is meaningful for the given data. In
the data we used for our experiments this assumption was true and they outperformed all
the alternatives. However on other kinds of data where this assumption is violated, Infogain,
which does not use this assumption, might outperform Simba and G-ip. Thus, there is no
one selection algorithm which is the ideal for all problems. When data of a new kind emerge,
we need to try many selection algorithms to nd which is most suitable. Of course, we can
also use some prior knowledge on the data to predict which selection algorithm is most
110 Chapter 6. Epilog
adequate, if such prior knowledge exists. By analyzing the success of the dierent kinds of
algorithms we can also discover some important properties of the data. An example of such
an analysis is given in the chapter 4. In section 4.4 of this chapter we use the fact that RGS
works better than SKS on the motor control task to suggest that complex dependencies
upon sets of neurons or time bins are an important property of the neural code.
List of Publications
In order to keep this document reasonably sized, only a subset of the work I have done during
my studies is presented in this dissertation. Here is a complete list of my publications.
Journal Papers
• R. Gilad-Bachrach, A. Navot, and N. Tishby, A Study of The Information Bottleneck

Method and its Relationship to Classical Problems. Under review for publication in
IEEE transaction on information theory.
• E. Krupka, A. Navot and N. Tishby, Learning to Select Features using their Properties.
Submitted to Journal of Machine Learning Research (JMLR).
Chapters in Refereed Books

• A. Navot, R. Gilad-Bachrach, Y. Navot and N. Tishby. Is Feature Selection Still
Necessary? In C. Saunders, M. Grobelnik, S. Gunn and J. Shawe-Taylor, editors,
Subspace, Latent Structure and Feature Selection. Springer-Verlag, 2006.
• R. Gilad-Bachrach, A. Navot and N. Tishby Large margin principles for feature se-
lection. In Feature extraction, foundations and applications, I. Guyon, S. Gunn, M.
Nikravesh and L. Zadeh (eds.) , Springer-Verlag 2006.
• R. Gilad-Bachrach, A. Navot and N. Tishby Connections with some classic IT prob-
lems. In Information Bottlenecks and Distortions: The emergence or relevant structure

from data, N. Tishby and T. Gideon (eds.) MIT press (in preparation).
111
112 Chapter 6. Epilog
Refereed Conferences
• R. Gilad-Bachrach, A. Navot and N. Tishby, Query By Committee made real, in Pro-
ceedings of the 19th Conference on Neural Information Processing Systems (NIPS),
2005.
• R. Gilad-Bachrach, A. Navot and N. Tishby, Bayes and Tukey meet at the center point,
in Proceedings of the 17
th Conference on Learning Theory (COLT), 2004.
• R. Gilad-Bachrach, A. Navot and N.Tishby, Margin based feature selection - theory and
algorithms, in Proceedings of the 21st International Conference on Machine Learning
(ICML), 2004.
• R. Gilad-Bachrach, A. Navot, and N. Tishby, An information theoretic tradeo between

complexity and accuracy, in Proceedings of the 16th Conference on Learning Theory
(COLT), pp. 595-609, 2003.
• K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, Margin analysis of the lvq

algorithm, in Proceedings of the 16th Conference on Neural Information Processing
Systems (NIPS), 2002.
• A. Navot, L. Shpigelman, N. Tishby and E. Vaadia, Nearest Neighbor Based Feature

Selection for Regression and its Application to Neural Activity. Proc. 20th Conference
on Neural Information Processing Systems, 2006
Technical Reports
• R. Gilad-Bachrach, A. Navot, and N. Tishby, Kernel query by committee (KQBC),

technical report 2003-88, Leibniz Center, the Hebrew University, 2003.
Bibliography
[1] D. Aha and R. Bankert. Feature selection for case-based classication of cloud
types: An empirical comparison, 1994.
[2] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms.
Machine Learning, 6:3766, 1991.

[3] R. F. Ahlswede and J. Korner. Source coding with side information and a
converse for degraded broadcast channels. IEEE transaction on information

theory, 21(6):629637, November 1975.
[4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In
Proceedings of the Ninth National Conference on Articial Inte lligence (AAAI-

91), volume 2, pages 547552, Anaheim, California, 1991. AAAI Press.
[5] T. W. Anderson. Classication by multivariate analysis. Psychometria, 16:3150,
1951.
[6] D. Angluin. Queries and concept learning. Machine Learning, 2:319342, 1988.
[7] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11.
[8] Jerzy Bala, J. Huang, Haleh Vafaie, Kenneth DeJong, and Harry Wechsler. Hy-
brid learning using genetic algorithms and decision trees for pattern classica-
tion. In IJCAI (1), pages 719724, 1995.

[9] P. Bartlett. The size of the weights is more important than the size of the
network. IEEE Transactions on Information Theory, 44(2):525536, 1998.

[10] Jonathan Baxter. A bayesian/information theoretic model of learning to learn
viamultiple task sampling. Mach. Learn., 28(1):739, 1997.

[11] A. Bell and T. Sejnowski. An information-maximization approach to blind sep-
aration and blind deconvolution. Neural Computation, 7:11291159, 1995.

[12] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University
Press, 1961.
113
114 BIBLIOGRAPHY
[13] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and
the vapnik-chervonenkis dimension. Journal of the ACM, 100:157184, 1989.

[14] B. Boser, I. Guyon, and V. Vapnik. Optimal margin classiers. In Fifth Annual
Workshop on Computational Learning Theory, pages 144152, 1992.
[15] C. Campbell, N. Cristianini, and A. Smola. Query learning with large mar-
gin classiers. In Proceedings of the 17th International Conference on Machine

Learning (ICML), 2000.
[16] Rich Caruana. Multitask learning. Mach. Learn., 28(1):4175, 1997.
[17] Rich Caruana and Dayne Freitag. Greedy attribute selection. In International
Conference on Machine Learning, pages 2836, 1994.
[18] G. C. Cawley. MATLAB support vector machine toolbox (v0.55β ) [
http://theoval.sys.uea.ac.uk/gcc/svm/toolbox]. University of East An-
glia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ, 2000.
[19] Kevin J. Cherkauer and Jude W. Shavlik. Growing simpler decision trees to
facilitate knowledge discovery. In Knowledge Discovery and Data Mining, pages

315318, 1996.
[20] Shay Cohen, Eytan Ruppin, and Gideon Dror. Feature selection based on the
shapley value. In IJCAI, pages 665670, 2005.

[21] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries
and selective sampling. Advanced in Neural Information Processing Systems 2,

1990.
[22] T.M. Cover and P.E. Hart. Nearest neighbor pattern classier. IEEE Transac-
tions on Information Theory, 13:2127, 1967.
[23] K. Crammer. Mcsvm_1.0: C code for multiclass svm, 2003.
http://www.cis.upenn.edu/∼crammer.
[24] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin analysis of
Proc. 17'th Conference on Neural Information Processing

the lvq algorithm. In
Systems (NIPS), 2002.

[25] Ofer Dekel and Yoram Singer. Data-driven online to batch conversions. In
Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information

Processing Systems 18. MIT Press, Cambridge, MA, 2006.
[26] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum-likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, B, 39:138,
1977.
BIBLIOGRAPHY 115
[27] L. Devroye. The uniform convergence of nearest neighbor regression function es-
timators and their application in optimization. IEEE transactions in information

theory, 24(2), 1978.
[28] L. Devroye, L. Gyor, and G. Lugosi. A Probabilistic Theory of Pattern Recog-
nition. Springer, New York, 1996.
[29] R.O. Duda, P.E. Hart, and Stork D.G. Pattern Classication 2'nd edition. Wiley-
Interscience Publication, 2000.
[30] L. Kasatkina E. Kussul, T. Baidyk and V. Lukovich. Rosenblatt perceptrons
for handwritten digit recognition. In Int'l Joint Conference on Neural Networks,

pages 151620, 2001.
[31] L. Ein-Dor, O. Zuk, and E. Domany. Thousands of samples are needed to
generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci
U S A, 103(15):59235928, April 2006.
[32] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination:
Consistency properties. Technical Report 4, USAF school of Aviation Medicine,
1951.
[33] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sci-

ences, 55(1):119139, 1997.
[34] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the
query by committee algorithm. Machine Learning, 28:133168, 1997.

[35] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory
data analysis. IEEE Trans. of Computers, 23(9):881890, 1974.

[36] Jerome H. Friedman. Exploratory projection pursuit. Journal of the American
Statistical Association, 82(397):249266, 1987.
[37] J. H. Gennari, P. Langley, and D. Fisher. Models of incfremental concept for-
mation. Articial Intelligence, 40:1161, 1989.

[38] R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, Hebrew University, 2006.
[39] R. Gilad-Bachrach, A. Navot, and N. Tishby. An information theoretic tradeo
between complexity and accuracy. In Proc. 16'th Conference on Computational

Theory (COLT), pages 595609, 2003.
[40] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection -
theory and algorithms. In Proc. 21st (ICML), pages 337344, 2004.

116 BIBLIOGRAPHY
[41] R. Gilad-Bachrach, A. Navot, and N. Tishby. Large margin principles for feature
selection. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature

extraction, foundations and applications. Springer, 2006.
[42] R. Greiner. Using value of information to learn and classify under hard bud-
gets. NIPS 2005 Workshop on Value of Information in Inference, Learning and
Decision-Making, 2005.
[43] I. Guyon and A. Elissee. An introduction to variable and feature selection.
Journal of Machine Learnig Research, pages 11571182, Mar 2003.
[44] I. Guyon and S. Gunn. Nips feature selection challenge. http://www.nipsfsc.

ecs.soton.ac.uk/, 2003.
[45] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature extraction, foundations

and applications. Springer, 2006.
[46] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.

Springer, 2001.
[47] P. M. Hofman, J. G. A. Van Riswick, and A. Van Opstal. Relearning sound
localization with new ears. Nature Neuroscience, 1:417421, 1998.
[48] j. Holland. Adaption in Neural and Articial Systems. University of Michigen
Press, 1975.
[49] J. Hua, Z. Xiong, and E. R. Dougherty. Determination of the optimal number
of features for quadratic discriminant analysis via normal approximation to the
discriminant distribution. Pattern Recognition, 38(3):403421, March 2005.
[50] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435475, 1985.
[51] J.P. Ignizio. Introduction to Expert Systems. McGraw-Hill, Inc., USA, 1991.
[52] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing
the curse of dimensionality. In Proceedings of the 30th ACM Symposium on the

Theory of Computing, pages 604613, 1998.
[53] A. K. Jain and W. G. Waller. On the monotonicity of the performance of bayesian
classiers. IEEE transactions on Information Theory, 24(3):392394, May 1978.
[54] A. K. Jain and W. G. Waller. On the optimal number of features in the classi-
cation of multivariate gaussian data. Pattern Recognition, 10:365374, 1978.

BIBLIOGRAPHY 117
[55] George H. John, Ron Kohavi, and Karl Peger. Irrelevant features
and the subset selection problem. In International Conference on Ma-

chine Learning, pages 121129, 1994. Journal version in AIJ, available at
http://citeseer.nj.nec.com/13663.html.
[56] I.T. Jolliee. Principal Component Analysis. Springer Varlag, 1986.
[57] M.C. Jones and R. Sibson. What is projection pursuit ? J. of the Royal Statistical
Society, ser. A, 150:136, 1987.
[58] M. W. Kadous and C. Sammut. Classication of multivariate time series and
structured data using constructive induction. Mach. Learn., 58:179216, 2005.

[59] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A
survey. Journal of Articial Intelligence Research, 4:237285, 1996.

[60] Michael J. Kearns and Umesh V. Vazirani. An introduction to computational
learning theory. MIT Press, Cambridge, MA, USA, 1994.
[61] K. Kira and L. Rendell. A practical approach to feature selection. In Proc. 9th
International Workshop on Machine Learning, pages 249256, 1992.
[62] R. Kohavi and G.H. John. Wrapper for feature subset selection. Articial Intel-
ligence, 97(1-2):273324, 1997.
[63] T. Kohonen. Self-Organizing Maps. Springer-Verlag, 1995.
[64] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Inter-
national Conference on Machine Learning, pages 284292, 1996.
[65] I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In
Proc. ECML, pages 171182, 1994.

[66] E. Krupka and N. Tishby. Generalization in clustering with unobserved features.
In NIPS, 2006.
[67] J. B. Kruskal. Articial Intelligence: A Modern Approach. Prentice Hall, 2
nd
edition edition, 2002.
[68] K. J. Lang and E. B. Baum. Query learning can work poorly when a human
oracle is used. In Proceedings of the International Joint Conference on Neural

Networks, pages 335340, 1992.
[69] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
and L.J. Jackel. Backpropagation applied to handwritten zip code recognition.
Neural Computation, 1:541551, 1989.

118 BIBLIOGRAPHY
[70] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):22782324, November

1998.
[71] K. Levi, M. Fink, and Y. Weiss. Learning from a small number of training
examples by exploiting object categories. LCVPR04 workshop on Learning in
Computer Vision, 2004.
[72] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Al-

gorithms. PhD thesis, University of California Santa Cruz, 1989.
[73] H. Liu and R. Sutiono. Feature selection and discretization. IEEE Trans Knowl-
edge and Data Eng, 9.
[74] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretiza-
tion: An enabling technique. Data Min. Knowl. Discov., 6(4):393423, 2002.

[75] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-bayes classi-
ers. In UAI, 2003.

[76] George F. Luger. Articial Intelligence: Structures and Strategies for Complex
Problem Solving. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 2001.
[77] J. MacQueen. Some methods for classication and analysis of multivariate ob-
servations. Proceedings of the Fifth Berkeley Symposium on Mathematical

In
statistics and probability, pages 281297. University of California Press, 1967.

[78] T. Marill and D Green. On the eectiveness of receptors in recognition systems.
IEEE Transactions on Information Theory, 9:1117, 1963.

[79] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners.
In Articial Intelligence Review, volume 11, pages 193225, April 1997.

[80] A.M. Martinez and R. Benavente. The ar face database. Technical report, CVC
Tech. Rep. #24, 1998.
[81] L. Mason, P. Bartlett, and J. Baxter. Direct optimization of margins improves
generalization in combined classier. Advances in Neural Information Processing

Systems, 11:288294, 1999.
[82] A.J. Miller. Subset Selection in Regression. Chapman and Hall, 1990.
[83] Kensaku Mori and Hiroshi Nagao andYoshihiro Yoshihara. The olfactory bulb:
Coding and processing of odor molecule information. Science, 286(5440):711
715, 1999.
BIBLIOGRAPHY 119
[84] A. Navot, R. Gilad-Bachrach, Y. Navot, and N. Tishby. Is feature selection
still necessary? In C. Saunders, M. Grobelnik, S. Gunn, and J. Shawe-Taylor,
editors, Subspace, Latent Structure and Feature Selection. Springer-Verlag, 2006.

[85] A. Navot, L. Shpigelman, N. Tishby, and E. Vaadia. Nearest neighbor based
feature selection for regression and its application to neural activity. In Proc.
20 th
Conference on Neural Information Processing Systems (NIPS), forthcoming
2006.
[86] A. Ng. Feature selection, l1 vs l2 regularization and rotational invariance. In
Proc. 21st International Conference on Machine Learning (ICML), pages 615
622, 2004.
[87] C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms

and Complexity. Prentice Hall, 1982.
[88] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity
in motor cortex reects learning of local visuomotor skills. Nature Neuroscience,

6(8):882890, August 2003.
[89] J. Pearl. Probablistic Reasonong in Intelligent Systems. Morgan Kaufmann, San
Mateo, CA, 1988.
[90] A. Perera, T. Yamanaka, Gutierrez-Galvez A., B. Raman, and R. Gutierrez-
Osuna. A dimensionality-reduction technique inspired by receptor convergence
in the olfactory system. Sensors and Actuators B, 116(1-2):1722, 2006.

[91] Bernhard Pfahringer. Compression-based feature subset selection, 1995.
[92] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan-
nery. Numerical Recipes in C: The Art of Scientic Computing. Cambridge
University Press, Cambridge, UK, 1988.
[93] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, 1:81

106, 1986.
[94] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[95] R. Raina, A.Y. Ng, and D. Koller. Constructing informative priors using transfer
learning. In Proc. Twenty-Third International Conference on Machine Learning,

2006.
[96] S. Raudys and V. Pikelis. On dimensionality, sample size, classication error and
complexity of classication algorithm in pattern recognition. IEEE Transactions

on pattern analysis and machine intelligence, PAMI-2(3):242252, May 1980.
120 BIBLIOGRAPHY
[97] J. Rissanen. Modeling by shortest data description. Automatica, 14:465471,
1978.
[98] F. Rosenblatt. The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65(6):386408, 1958.

[99] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290:23232326, 12 2000.

[100] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin :
A new explanation for the eectiveness of voting methods. Annals of Statistics,

1998.
[101] Thomas Serre, Maximilian Riesenhuber, Jennifer Louie, and Tomaso Poggio. On
the role of object-specic features for real world object recognition in biologi-
cal vision. In BMCV '02: Proceedings of the Second International Workshop
on Biologically Motivated Computer Vision, pages 387397, London, UK, 2002.

Springer-Verlag.
[102] C. E. Shannon. A mathematical theory of communication. Bell System Technical

Journal, 27, July and October 1948.
[103] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural
risk minimization over data-dependent hierarchies. IEEE transactions on Infor-

mation Theory, 44(5):19261940, 1998.
[104] P. Simard, Y. LeCun, J. S. Denker, and B. Victorri. Transformation invariance
in pattern recognition-tangent distance and tangent propagation. In Neural Net-

works: Tricks of the Trade, pages 23927, 1996.
[105] P. Y. Simard, Y. A. Le Cun, and J. Denker. Ecient pattern recognition using
a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles,
editors, Advances in Neural Information Processing Systems, volume 5, pages
5058. Morgan Kaufmann, 1993.
[106] David B. Skalak. Prototype and feature selection by sampling and random muta-
tion hill climbing algorithms. In International Conference on Machine Learning,

pages 293301, 1994.
[107] G. W. Snedecorand and W. G. Cochran. Statistical Methods. Iowa State Uni-
versity Press, Berlin, Heidelberg, New York, 1989.
[108] B. Taskar, M. F. Wong, and D. Koller. Learning on the test data: Leveraging
unseen features. In ICML, 2003.

BIBLIOGRAPHY 121
[109] D. M. Taylor, S. I. Tillery, and A. B. Schwartz. Direct cortical control of 3d
neuroprosthetic devices. Science, 296(7):18291832, 2002.

[110] S. Thrun. Is learning the n-th thing any easier than learning the rst? In Proc.
15th Conference on Neural Information Processing Systems (NIPS) 8, pages
640646. MIT Press, 1996.
[111] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In
Proc. 37th Annual Allerton Conf. on Communication, Control and Computing,

pages 368377, 1999.
[112] Godfried Toussaint. Proximity graphs for nearest neighbor decision rules: Recent
progress.
[113] G. V. Trunk. A problem of dimensionality: a simple example.IEEE Transactions

on pattern analysis and machine intelligence, PAMI-1(3):306307, July 1979.
[114] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermediate com-
plexity and their use in classication. Nature Neuroscience, 5(7):682687, 2002.

[115] H. Vafaie and K. DeJong. Robust feature selection algorithms. Proc. 5th Intl.
Conf. on Tools with Artical Intelligence, pages 356363, 1993.
[116] V. Vapnik. The Nature Of Statistical Learning Theory. Springer-Verlag, 1995.
[117] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
[118] V. Vapnik and A. Y. Chervonenkis. On the uniform covergence of relative fre-
quencies of events to their probabilities. Theory of Probability and its Applica-

tions, 16(2):264280, 1971.
[119] J. Weston, A. Elissee, G. BakIr, and F. Sinz. The spider, 2004.
http://www.kyb.tuebingen.mpg.de/bs/people/spider/.
[120] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik.
Feature selection for SVMs. In Proc. 15th NIPS, pages 668674, 2000.
[121] A. D. Wyner. On source coding with side information at the decoder. IEEE
transaction on information theory, 21(3):294300, May 1975.
122 BIBLIOGRAPHY
ziaeyig dcinla zeipekz zxiga lr
diteqelitl xehwec xèz zlaw myl xeaig

z`n
zeap xin`
e'qyz zpya milyexia zixard dhiqxaipe`d hpql ybed

II
III
iayiz ilztp xeqtext ly eziigpda dzyrp ef dcear

IV
xivwz
zeipekz zxigaa ,hxtae ,ziaeyig dcinla zeipekz zxiga ly ly mipey mihwtqà dpc ef dfz
xvw èanae ziaeyig dcinl ìd dn lr xvw xaqda ligzp df xivwza okl .dgpen dcinl xear
.dfza zebvend zeifkxnd zeycgd ze`vezd z` dxvwa bivp jynda .zipekzd zxiga megzl
ziaeyig dcinl
millk zwqde dcinl ly miineyiie miinzixebl` ,miihxeìz mihaida zwqer ziaeyig dcinl
dpekn zepal miqpn ep`y ìd dpeekd ze`nbecn dcinl epxneà ,blfnd dvw lr .ze`nbecn
xnelk) cneld llk jxca .ze`nbeca diitv i"r dniyn rval cenll zlbeqnd (k"ca aygn zpkez)
-gz zzl xyt`nd mlerd ly lcen zepal icka ze`nbec ly oeni` zveawa ynzyn (dpeknd
-edy miweg sqeà yeniy i"r zeifgz zzl zlbeqnd dpkezl cebipa df .mlerd lr zepin` zeif
cenll dkixv dpeknd ziaeyig dcinla okl .(zizek`ln dpia ly ziq`lwd dyibd) y`xn excb
:l"fg ly dpyid dxinà dwac ziaeyig dcinl ly dyibd df oaenae ,ìd "dpeiqpn"
oeiqip lrak mkg oi`
-nzn xen`k ep` ,ziaeyig dcinl ly megzd zgz mixwgpd mipey dcinl ilcen xtqn mpyi
lk ea biiezn mbcn oeni` zveawk lawn cneld dgpen dcinla .dgpen dcinla ef dceara micw
z` xyt`d lkk wiecna zefgl gilvdl ìd cneld ly ezxhn .(zieez ,mvr) dxevdn bef ìd `nbec
zihwxhqa` zeyi zeidl leki mvrd illk oteà .mcew d`x `l èd eze` ycg mvr ly zieezd
-xtqn ly xehwe i"r bvein mvrdy migipn ziaeyig dcinla k"ca j` ,hqwh e` dpenz ,mc` oebk
ly xehwe i"r bvein mvrd okle ,(feature) zipekz z`xwp df xehweea dh`picxeèw lk .miiynn mi
,dtivx e` (classication) beiz ziira epl yiy xn`p fè ,zixebhw zeidl dleki zieezd .zeipekz
caln) beiz zeiraa onfd aex cwnzp df xeaiga .(regression) diqxbx ziira epl yiy xn`p fè
V
VI
epl reci epi`y (dxhnd byen `xwpd) reaw llk it lr dyrp beizdy ìd k"ca dgpdd .(4 wxta
df llk axwiy (dxryd `xwpd) beiz llk èvnl icka oeni`d zveawa ynzydl ìd epzxhne
.ahid
èd aeh dnk xnelk ,cnlpd llkd ly dllkdd zleki ìd cnel ly ezeki`l zlaewnd dcind
dcinla micwnzn ep` .cenild onfa d`x `l cneldy ycg mvr lr dxhnd byen ly ekxr z` dfeg
ze`nbecy migipn ea [811 ,31] Probably Approximately Correct lcena hxtae ziaihwecpi`
beiz llk ly dllkdd zleki o`k .(iid) dreaw zebltzd jezn ielz izla oteà enbcp oeni`d
izla oteà dnbcpy dycg `nbec lr drhi èdy iekiqd ìdy ely dllkdd zìby i"r zccnp
oeni`d zìbya lkzqp okle df lceb zexiyi cecnl lkep `l k"ca z`f mr .zebltzd dze`n ielz
.oeni`d ze`nbec llk jezn dreh cnlpd llkd odilr oeni`d ze`nbec ly iqgid owlgk zxcbend
eli` zgz èvnl xnelk ,(dcinll ozip) cinl dn oiit`l dqpn ziaeyig dcinl ihxeìzd cva
zelabnd lr zepaez zzl ieyr dfk oeit` .dphw dllkd zìby dgihan dphw oeni` zìby mi`pz
dleki eply dxrydd xnelk ,zelabn mey lihp `l m`y oaen .ziivl dkixv zcnel zkxrn lk odl
dllkdd zìbyl oeni`d zìby oia xrtd z` meqgl ozip `l ,didiy lkk jaeqn ,xac lk zeidl
mlern dxgap eply dxryddy ìd zlaewnd dgpdd .odylk zeliabn zegpd gipdl miaiig okle
ly divwpetk xrtd z` meqgl ozip df dxwna ."ziqgi heyt"e oezp (zexryd zwlgn) zexryd
dheyt zexrydd zwlgny lkky jk ,oeni`d ze`nbec xtqn lye zexrydd zwlgn zeikeaiq
oexwird ly dnbcd ìd ef d`vez .xrtd lr mqg eze` gihadl icka ze`nbec zegt jxhvp ,xzei
: Occam's Razor ly
entia non sunt multiplicanda praeter necessitatem
yxcpl xarn zeltkeyn zeidl zekixv `l zeieyi :l mbxziny
xzeia aehd èd xzeia heytd xaqdd :k ycgn gqepn zeaexw mizirle
daeh dxryd èvnl milbeqnd minzixebl` èvnl dqpn ziaeyig dcinl inzixebl`d cva
mxear minzixebl` sicrp ilìci` oteà .oeni`d zveaw ozpida (dxhnd byen z` aeh zaxwnd)
heyt"y miihqxei minzixebl` mb z`f mr .d`vezd weic lre dvixd onf lr minqg gikedl ozip
gikedl ozip dxear dtelgdn xzei aeh micaer md lreta mizirle megza icnl mivetp "micaer
.zeitivtq zeniynl minzixebl` mi`zdl dqpn ziaeyig dcinl ineyiid cva .zepey zepekz
mineyil ze`nbec dnk .diiyrza ode dincwà od zeax zeniynl miynyn dcinl inzixebl`
.i`xy` iqihxka zeinxz iedif ,ziètx dpga` ,mihqwh oein ,mitevxt iedif :od
VII
beviid ziira
minvry migipn dcinld ilcen aexa .minvrd z` bviil ji` ìd ziaeyig dcinla zifkxn dl`y
ligzn dcinld jildz ly gezipde ,(edylk iteq cnin èd N xy`k) RN a mixehwek mipezp
ebiyi mixiaqd dcinld inzixebl` aex ,aeh bevii ozpiday jk lr dagx dnkqd dpyi .ef dcewpn
mirevia biydl deewz oi` rexb bevii xgap m` ipy cvn .mi`zn oepeek xg`l mixiaq mirevia
ly xehwe ici lr (zepenz lynl) minvrd ly aeh bevii mipea ji` dl`yd zl`yp okl .miaeh
zxiga .jci`n epiptl zgpend diral zernyn lrae cgn izivnz zeidl jixv aeh bevii ?mixtqn
df zeipekz sqe` ,zeizin` zeiraa .mvr lk xear eccniy zeipekz ly sqe` xegal dzrenyn bevi
bevii èvnl ozip m`d dl`yd zl`yp j` .ipci oteà ihpeelxd megza dgnen ici lr xgap k"ca
.oeni`d zveawa yeniy jez ihnehe` oteà aeh
(xviil e`) cecnla ligzp df dxwna .zeipekz zxiga zxfra ìd ef dira sewzl mikxcd zg`
-xeblà ynzyp fè ,zeaeh zeipekz mb zeàgzn odipiay gipdl xiaqy jk ,zeipekz ce`n daxd
deedzy xnelk ,miaeh mirevia zxyt`nd dphw dveaw zz èvnl icka zeipekz zxigal mzi
cnin zcxed inzixeblà yeniy ipt lr zeipekz zxigaa yeniy ly miixwird zepexzid .aeh bevii
zeipekz cizra cecnl jxevd jqkpy jka) oekqige d`vezd ly xzei dgep geprt zleki md millk
.(exgap `ly
zipekz zxiga
xy`k ,zeipekz ly ce`n lecb zxtqn ici lr mibvein minvrd ,mixwndn daxda ,dgpen dcinla
zxiga .wiecn iefig xvil zlekia mirbet sè ,zeibzd ly iefig jxevl zevegp `l odn lecb wlg
iefig xyt`nd zeipekzd llk jezn zeipekz ly (ohw k"ca) xtqn zxiga ly dniynd ìd zeipekz
:md zeipekz zxiga revial miixwird miripnd rax` .dxhnd zeibz ly aeh
mipezp lr dvxdl mipzip `l heyt miax dcinl inzixebl` .ziaeyigd zeikeaiqd xetiy .1
aly .mdly ddeabd ziaeyigd zeikeaiqd llba ,zeipekz ly mevr xtqn ci lr mibveind
.ef dira xeztl leki zeipekz zxiga ly micwn
zeipekzd lk z` (iefigd alya) cizra cecnl jxevd z` zrpen zeipekz zxiga .ilklk oekqg .2
leki minieqn mixwnae mevr zeidl leki df oekqg .mzcicn zelr zkqgp jkae exgap `ly
.miyil miyi izlan oiiprd lk z` jetdl

VIII
llkd ly iefigd zleki z` xtyl mb dleki zeipekz zxiga mixwn daxda .weicd xetiy .3
epl mirecid xzeia miaehd dcinld inzixebl`d mb .yrxl ze`d qgi xetiy ici lr cnlpd
`l hrnk e` zeihpeelx `l zeipekz ly mevr xtqn ly mzegkep lr xabzdl mileki `l meik
oteà weicd z` xtyl leki zeipekz zxiga ly micwn crv el` mixwna ,okl .zeihpeelx
.ihnxc
daxda .xeztl miqpn ep`y dirad zpada xefrl dleki exgapy zeipekzd zedf .dirad zpad .4
aeygp lynl .envr beizd weicn xzei daeyg zeaeygd zeipekzd lr riavdl zlekid mixwn
mikxc cer opyiy cera .mipey mipb ly iehia zenx it lr (dleg `l/dleg) iètx oega` lr
dlgnd oepbpn zpada xefrl dleki iefigl miaeygy mipbd zedf ,`l e` dleg mc` m` zrcl
.zetexz gezitae
4 e 2 miripn iabl .zeipekz iaexn mipezp ly dliri dcinla aeyg aikxn ìd zeipekz zxiga okl
.zeillk cnin zcxed zehiy ipt lr oexzi yi zeipekz zxigal ,lirl
eliè zeihpeelx zeipekz el` iedifk dniynd z` mixicbn zeipekz zxiga lr mixwgn daxda
zipekz ly dnexzdy oeeikn .zeihpeelx zxcbd ly dirad z` dlrn ef dxcbd .zeihpeelx opi`
epzip zepey zexcbde ,df byen xicbdl heyt `l llk ,miynzyn ep` ztqep zeipekz elià dielz
zepekzd zveawa zexag zxxeb cinz `l zeihpeelx el` zexcbd lk itl j` .mipey mixweg ici lr
dveaw-zz zxiga" k zeipekzd zxiga zniyn z` xicbdl miticrn ep` okl .jtidle zilnihte`d
dirad zpad ìd dxhndyk ,z`f mr ."dxhnd zeibz ly aeh iefig zxyt`nd zeipekz ly dphw
hrnk zeizin` zeiraa j` .daeyg ìd zipekz lk ly zeihpeelxd zcin zrici ,(lirl 4 ripn)
.zipekz lk ly dnexzd zcin lr xacl sicr okle ,zihpeelx `l oihelgl dpi` zipekz mrt s`
lceb èd ilty jxr .dzeaiygl ccnk zipekz ly (Shapley) ilty jxr z` erivd [20] eixage odk
.izveaw wgyna owgy ly enexz zcicnl ynyn èd my ,miwgynd zxezn gewld
mbe ziaeyigd dcinld megza mb zeipekz zxigal mipyd jyna erved miax minzixebl`
miwleg mdn miax j` ,mdipia ce`n mipey minzixebl`dn wlg .dwihqihhqd megza xwgna
zeipekz ly zeveawl oeiv zpzepd dkxrd zivwpet dpyi df dpana .oldl x`zpy illkd dpand z`
ly dygndl 1.1 xei` d`x .deab oeiv mr zeipekz zveaw yetigl ynynd yetig mzixebl` epyie
zthrnd lcen) mieqn dcinl mzixebl` lr zqqean zeidl dleki dkxrdd zivwpet .df dpan
aexa .([26] oepiqd lcen) iefigl zipekz lk ly dznexz iefigl zillk dcin idyefi` lr e` ,([26]
lr cnlpd llkd zwica ici lr zkxreyn zeipekz zveaw ly dzeki` ,mitherd minzixebl`d
IX
lcebd lr zelret ody jka èd df beqn zehiy ly ixwird oexzid .(validation set) dkxrd zveaw
onfa èd oexqigd .dcinll ynzyp ea mzixebl`d ly iefigd zeki` - epze` oiiprn znày
oepiqd inzixebl`n miax .dcinld mzixebl` z` ycgn uixdl yi crv lkay oeeikn ,jex` dvix
divnxetpi` e` divlxew mcwn ,(beizd ozpida zipekzd ly) dpzen zepey oebk milcba miynzyn
.zeki`l millk miccnk ztzeyn
okle ixyt` epi` zeipekz ly zeixwird zeveawd izz lk ipt lr `vnn yetig ,dxwn lka
.zeiq`lw zehiy izy wx rbxk xikfp .zepey zewihqixeia zeynzynd ,yetig zehiya miynzyn
miligzn dnicw dxigaa .(backward elimination) xeg`l dxqde (forward selection) dnicw dxiga
z` mitiqen crv lka .ipcng oteà zg` zipekz mitiqen crv lkae dwix zeipekz zveaw mr
lk zveawa mligzn xeg`l dxqda .dkxrdd zivwpet jxra iaxn lecibl dìand zipekzd
ly dkxra (zixrfn dphwdl e`) iaxn lecibl dìand zipekzd z` mixiqn crv lkae zeipekzd
.lecb zeipekzd xtqn xy`k jex` dvix onf èd ef dyib ly ixwird dpexqg .dkxrdd zivwpet
zxiga jxevl erved miax minzixebl` ,epxkfd xaky itk .oaenk ok mb ixyt` miizyd oia aeliy
miwcea heyt mwlg .yetigd oteà ode dkxrdd zivwpet ly dxigaa od mdipia mipeyd zeipekz
xzeia ddeabd ziyi`d zeki`d zelra zeipekzd z` mixgeae dcal zipekz lk ly dzeki` z`
-rete lirl dbvedy dnbicxtdn mibxeg llka mixg` .zeipekz ly zeveawl mb miqgiizn mwlge
xivwza xewqln drixid dxvw .cenild jildza zilxbhpi` zealzyd lynl ,zxg` dyiba mil
qqazn mzixebl` lky oiadl aeyg j` ,xara ervedy miaxd minzixebl`d oian mbcn elit` df
mr z`f znerl .lykidl lelr mzixebl`d ,zeniiwzn `l el` zegpd m` .zeniieqn zegpd lr
wwcfdl xnelk ,xzei aeh cearl ietv mze` gipnd xg` mzixebl` ,zenwiizn xzei zewfg zegpd
ohw zexryd agxn jeza ytgn èdy oeeikn z`f ,mirevia mze`l ribdl icka ze`nbec zegtl
zepey zeiral ."xzeia aehd" cinz èdy cg` dxiga mzixebl` meiwl zetvl ozip `l okl .xzei
zxiga ik xikfdl mewnd mb o`k .dirhe ieqipn qepn oiè xzei aeh elrti mipey minzixebl`
opyi xy`k xwira ,(overtting) xzi zn`zdl seyg okle envr ipta dcinl jildz ìd zeipekz
.oeni` ze`nbec hrn wx j` ce`n zeax zeipekz
ze`vezd xivwz
ly cxtp aly m`d dl`yd z` milrn ep` .zeipekz zxiga ly dzevigpa mipc ep` 2 wxta
ly ozegkep lr xabzdl mileki miipxcen dcinl inzixebl` ile` e` ,zyxcp ok` zeipekz zxiga
X
dirad ly ycg gezip mibivn ep` ef dl`y lr zeprl icka .mnvra zeipekz ly mevr xtqn
ep` oey`x alya .ziqe`b zebltzd zelra zewlgn izyn elxbedy zecewp beiz ly dheytd
zeipekzd xtqn ly divwpetk ezìby z` migzpn ep` .ziaxnd ze`xpd cixtn xear gezip mibivn
drxkd ly dìbyd enk drexb zeidl dlelr dìbydy ceray mi`xne oeni`d ze`nbec xtqn lye
zixyt`d zil`nihte`d dìbyl ztèy ìd ,zeipekz icin xzeia miynzyn xy`k zi`xw`
zeipekzd xtqn z` yxetn oteà mi`ven ep` ,sqepa .dnkga zeipekzd xtqn z` mixgea xy`k
ep` ,ipy alya .zeitivtq ze`nbec xtqn xear oeni`d ze`nbec xtqn ly divwpetk iahind
ixitn` oteà ,meik ziaeyig dcinld xzeia hlead beizd mzixebl` èdy , [41] SVM z` mipgea
,deab cnin mr ahid ccenznd mzixebl`l aygp SVM y zexnly ze`xn ze`vezd .deehn ezeà
zeipekz zxigay zefnxn el` ze`vez .eply gezipd ly zeifgzd z` wiecna minèz eirevia
miynzyn xy`k mb ,miwiecn micixtn zcinll zekxrn ly oepkza ihixw aikxn deedn oiicr
mieedn `l zeipekz zcicn zeielre miiaeyig miveli` xy`k mbe ,miipxcen dcinl inzixeblà
.lewiy
miileyd oexwr lr zeqqeand beiz zeiraa zeipekz zxigal zeycg zehiy mirivn ep` 3 wxta
.ezhlgdl qgia cixtn ly epeghia zcin zkxrdl zixhneìb dcin md [001 ,41] miiley .miiaxnd
hlea mzixebl` èd SVM ,`nbecl .zieeykr ziaeyig dcinla aeyg ciwtz miwgyn xak miiley
miileyd oexwira yeniya uerp df wxta zebvend ze`veza yecigd .miiaxn miiley lr qqeand
zxigal miycg minzixebl` gztl epl xyt`n miileya yeniyd .zeipekz zxiga myl miiaxnd
1-Nearest-Neighbor ly dllkdd zìby lr md dllkdd inqg .dllkd inqg gikedl mbe zeipekz
zeipekz zxiga mzixebl`l miitivtq mpi` minqgd .zxgap zeipekz zveawa yeniy dyrp xy`k
zeipekz zveaw xgead zeipekz zxiga mzixebl` lk xear miaeh mirevia migihan `l` ,miieqn
miiley qqean oeixhixwa miynzyn ep` inzixebl`d cva .milecb miiley lr dxiny jez dphw
,miycg zeipekz zxiga inzixebl` ipy mibivn ep` .zeipekz zeveaw ly mzeki` z` cecnl icka
ibeq xtqn lr znbcen el` minzixebl` ly mzleki .df oeixhixw lr miqqeand G-ip eSimba
.mipezp
yeniy miyer ep` .(regression) dtivx divwpet zcinla zeipekz zxigaa mipc ep` 4 wxta
da efl draha dnecd dkxrd zivwpeta miynzyne ztqep mrt 1-Nearest-Neighbor mzixeblà
j` ixìpil `l èdy diqxbx ziiraa zeipekz zxigal mzixebl` migztn ep` .3 wxta epynzyd
mihlwa dxhnd zivwpet ly zkaeqn zelz qetzl lbeqn minzixebl`d .lirie heyt z`f mr
mzixebl`d ly erah z` mxiaqn ep` .zirah dinfixlebxk leave-one-out d zìbya ynzyne

XI
,jynda .mireci minzixebl`l qgia eizepexzi z` minbcne miizek`ln mipezp zxfra epgzity
-xewa ehlwedy miwiitqn ci zrepz zexidn iefig ly xywda epgzity mzixeblà miynzyn ep`
mirivne iefigd zeki` z` xtyl migilvn ep` zeipekz zxigaa yeniy ici lr .bdpzn sew ly qwh
oteà) zepeyd zeipekza zakxen zelzy zefnxn ze`vezd .miilpexiep mipezp gezipl dycg jxc
.df iavr cewa zniiw ok` (miieqn onf glta miieqn oexiep ly zelirt ìd zipekz lk ,qb
migwel ep`y jk ici lr zeipekz zxiga ly ihxcphqd deehnd z` miaigxn ep` 5 wxta
xegal ìd ihxcphqd deehna zeipekz zxiga ly dxhnd .zeipekzd cnina dllkd mb oeayga
dnbec ly xywda wx dpecip dllkde ,zeipekz ly dpezp dveaw jezn zeipekz ly dveaw-zz
zeipekz eli` zexiyi èvnle zeqpl mewna ,df wxta .dycg zipekz ly xywda `le ,dycg
zipekz lky migipn ep` jk myl .zeaeh zeipekz ly mipiit`nd odn cenll miqpn ep` ,xzei zeaeh
oeni`d zveawd ynzydl ozip zrk .zeipekz-dhn mi`xew ep` mdl ,mipiit`n sqe` ici lr zbvein
df ietina ynzydl fè ,zipekzd ly dzeki`l zipekz ly zeipekz-dhnn ietin cenll zpn lr
zepexzi dyely mpyi .oeni`d zveawa drited `ly dycg zipekz ly dzeki` z` zefgl icka
dcinl ziirak zeipekzd zxiga ziira z` bivdl zxyt`n ìd ,oey`x xac .ef dyibl miifkxn
zxiga ly llekd jildzl xzei miaeh dllkd inqg zzl zxyt`n ef zelkzqd .zihxcphq
ytgl milbeqnd zeipekz zxiga inzixebl` gztl epl zxyt`n ef dyib ,ipy xac .beize zeipekz
ef dyib ,seqal .mevr èd wxtd lr zecnery zeipekzd xtqn xy`k mb zeliria zeaeh zeipekz
rci zxard ly xywda dycgd epzyib z` myiil ozip cvik mb mi`xn ep` .dirad zpadl znxez
dieyr zeaeh zeipekz ly mipiit`nd zxardy mi`xn ep` .( inductive transfer ) dniynl dniynn
z` minibcn ep` .onvr zeaehd zeipekzd odin rcid zxard xy`n xzei zeaeh ze`vezl ìadl
.ci azka zeaezkd zextq iedif ziira lr mipeyd miyeniyl zeipekz-dhna yeniyd

Navot PHD

Uploaded by

Copyright:

Available Formats

Navot PHD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Navot PHD

Uploaded by

Copyright:

Available Formats

On the Role of Feature Selection in

Thesis submitted in partial fulllment of the degree of

Submitted to the Senate of the Hebrew University

lead to much better performance.

the fundamental issue of data representation.

even if computational constraints or measuring costs are not an issue.

algorithms as well as to prove a PAC (Probably Approximately Correct) style generalization

demonstrated on various datasets.

suggest a novel way of exploring neural data.

assume that each feature is represented by a set of properties, referred to as meta-features.

the presence of a huge number of features. Finally, it contributes to a better understanding

in the dierent applications on a handwritten digit recognition problem.

1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Machine Learning, Articial Intelligence and Statistics . . . . . . . . 7

1.1.4 Machine Learning and Neural Computation . . . . . . . . . . . . . . . 8

1.1.5 Other Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Common paradigms for feature selection . . . . . . . . . . . . . . . . . 16

1.2.2 The Biological/Neuroscience Rationale . . . . . . . . . . . . . . . . . 21

2 Is Feature Selection Still Necessary? 24

2.2.1 Observations on The Optimal Number of Features . . . . . . . . . . . 28

2.3 Specic Choices of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 SVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Margin Based Feature Selection 38

3.3 Margins for 1-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Margin Based Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.1 Greedy Feature Flip Algorithm (G-ip) . . . . . . . . . . . . . . . . . 46

3.5.2 Iterative Search Margin Based Algorithm (Simba) . . . . . . . . . . . 47

3.5.3 Comparison to Relief . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5.4 Comparison to R2W2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Empirical Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7.1 The Xor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7.2 Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7.4 Face Images with Support Vector Machines . . . . . . . . . . . . . . . 59

3.7.5 The NIPS-03 Feature Selection Challenge . . . . . . . . . . . . . . . . 62

3.8 Relation to Learning Vector Quantization . . . . . . . . . . . . . . . . . . . . 63

3.9 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A Complementary Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 66

4 Feature Selection For Regression 69

4.2 The Feature Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Testing on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Hand Movement Reconstruction from Neural Activity . . . . . . . . . . . . . 76

5 Learning to Select Features 81

5.2 Predicting the Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Guided Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 Meta-features Based Search . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.2 Illustration on Digit Recognition Task . . . . . . . . . . . . . . . . . . 88

5.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.1 Generalization Bounds for Mufasa Algorithm . . . . . . . . . . . . . . 92

5.4.2 VC-dimension of Joint Feature Selection and Classication . . . . . . 96

5.5 Inductive Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.5.1 Demonstration on Handwritten Digit Recognition . . . . . . . . . . . 100

5.6 Choosing Good Meta-features . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7 Improving Selection of Training Features . . . . . . . . . . . . . . . . . . . . . 104

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A Notation Table for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

List of Publications 111

in which research in feature selection is related to biology in section 1.2.2.

Thesis submitted in partial fulllment of the degree of

in the dierent applications on a handwritten digit recognition problem.

1.1.3 Machine Learning, Articial Intelligence and Statistics . . . . . . . . 7

2.3 Specic Choices of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.1 Greedy Feature Flip Algorithm (G-ip) . . . . . . . . . . . . . . . . . 46

5.4.2 VC-dimension of Joint Feature Selection and Classication . . . . . . 96

on classication (except in chapter 4).

Thus the generalization ability of a classication rule h is measured by its generalization

the alternative provable algorithms. One possible algorithmic approach to (classication)

In terms of applications machine learning tries to adapt algorithms to a specic task.

knowledge) in order to nd a good representation automatically.

of supervised learning, dimensionality reduction algorithms try to nd a small number of

1.1.3 Machine Learning, Articial Intelligence and Statistics

AI in general tends to prompt profound philosophical issues, as a research eld in computer

dierences, analogies and conversions between batch and online analysis.

4. Problem understanding. Another benet of feature selection is that the identity of

information on the evaluation function in order to perform an ecient search.