Navot PHD

Download as pdf or txt
Download as pdf or txt
You are on page 1of 145

On the Role of Feature Selection in

Machine Learning

Thesis submitted in partial fulllment of the degree of


Doctor of Philosophy
by

Amir Navot

Submitted to the Senate of the Hebrew University

December 2006
ii
iii

This work was carried out under the supervision of Prof. Naftali Tishby.
iv
Acknowledgments

Many people helped me in many ways over the course of my Ph.D. studies and I would

like to take this opportunity to thank them all. A certain number of people deserve special

thanks and I would like to express my gratitude to them with a few words here. The rst

is my supervisor Naftali (Tali) Tishby who taught me a great deal and always supported

me, even when I put forward extremely wild and ill-thought out ideas. Tali played a ma-

jor part in shaping my scientic point of view. The second person is Ran Gilad-Bachrach,

who was both like a second supervisor and a good friend. He guided my rst steps as a

Ph.D. student, and was always ready to share ideas and help with any issue. Our many

discussions on everything were the best part of my Ph.D. studies. My other room mates,

Amir Globerson and Gal Chechik deserve thanks for all the help and the inspiring discus-

sions. Lavi Shpigelman, a good friend and a great classmate, was always ready to help on

everything from a worthy scientic question to a very technical problem with my computer.

Aharon Bar-Hillel challenged me with tough questions but also helped me nd the answers.

My brother, Yiftah Navot, has been my longstanding mentor and is always there to help

me with any mathematical problem. There is no way I can express how much I value his

assistance. My mother always made it very clear that studies are the most important thing.

Of course my wife, Noa, without whose endless support I never would have nished (or

probably have never even started) my Ph.D. studies. Finally I would also like to thank

Esther Singer for her eorts to improve my English, and all the administrative sta of both

the ICNC and CS who were always kind to me and were always willing to lend a hand. I

also thank the Horowitz foundation for the generous funding they have provided me.

v
vi
Abstract

This thesis discusses dierent aspects of feature selection in machine learning, and more

specically for supervised learning. In machine learning the learner (the machine) uses

a training set of examples in order to build a model of the world that enables reliable

predictions. In supervised learning each training example is an (instance, label) pair, and

the learner's goal is to be able to predict the label of a new unseen instance with only a small

chance of erring. Many algorithms have been suggested for this task, and they work fairly

well on many problems; however, their degree of success depends on the way the instances

are represented. Most learning algorithms assume that each instance is represented by a

vector of real numbers. Typically, each number is a result of a measurement on the instance

(e.g. the gray level of a given pixel of an image). Each measurement is a feature. Thus a

key question in machine learning is how to represent the instances by a vector of numbers

(features) in a way that enables good learning performance. One of the requirements of a

good representation is conciseness, since a representation that uses too many features raises

major computational diculties and may lead to poor prediction accuracy. However, in

many supervised learning tasks the input is originally represented by a very large number

of features. In this scenario it might be possible to nd a smaller subset of features that can

lead to much better performance.

Feature selection is the task of choosing a small subset of features that is sucient to

predict the target labels well. Feature selection reduces the computational complexity of

learning and prediction algorithms and saves on the cost of measuring non selected features.

In many situations, feature selection can also enhance the prediction accuracy by improving

the signal to noise ratio. Another benet of feature selection is that the identity of the

vii
viii

selected features can provide insights into the nature of the problem at hand. Therefore

feature selection is an important step in ecient learning of large multi-featured data sets.

On a more general level the feature selection research eld clearly enters into research on

the fundamental issue of data representation.

In chapter 2 we discuss the necessity of feature selection. We raise the question of whether

a separate stage of feature selection is indeed still needed, or whether modern classiers can

overcome the presence of huge number of features. In order to answer this question we

present a new analysis of the simple two-Gaussians classication problem. We rst consider

the maximum likelihood estimation as the underlying classication rule. We analyze its error

as a function of the number of features and number of training instances, and show that

while the error may be as poor as chance when using too many features, it approaches the

optimal error if we chose the number of features wisely. We also explicitly nd the optimal

number of features as a function of the training set size for a few specic examples. Then,

we test SVM [14] empirically in this setting and show that its performance matches the

predictions in the analysis. This suggests that feature selection is still a crucial component

in designing an accurate classier, even when modern discriminative classiers are used, and

even if computational constraints or measuring costs are not an issue.

In chapter 3 we suggest new methods of feature selection for classication which are based

on the maximum margin principle. A margin [14, 100] is a geometric measure for evaluating

the condence of a classier with respect to its decision. Margins already play a crucial role

in current machine learning research. For instance, SVM [14] is a prominent large margin

algorithm. The novelty of the results presented in this chapter lies in the use of large margin

principles for feature selection. The use of margins allows us to devise new feature selection

algorithms as well as to prove a PAC (Probably Approximately Correct) style generalization

bound. The bound is on the generalization accuracy of 1-NN on a selected set of features,

and guarantees good performance for any feature selection scheme which selects a small set

of features while keeping the margin large. On the algorithmic side, we use a margin based

criterion to measure the quality of sets of features. We present two new feature selection

algorithms, G-ip and Simba, based on this criterion. The merits of these algorithms are

demonstrated on various datasets.


ix

In chapter 4 we discuss feature selection for regression (aka function estimation). Once

again we use the Nearest Neighbor algorithm and an evaluation function which is similar in

its nature to the one used for classication in chapter 3. This way we develop a non-linear,

simple, yet eective feature subset selection method for regression and use it in analyzing

cortical neural activity. This algorithm is able to capture complex dependency of the target

function upon its input and makes use of the leave-one-out error as a natural regularization.

We explain the characteristics of our algorithm with synthetic problems and use it in the

context of predicting hand velocity from spikes recorded in motor cortex of a behaving

monkey. By applying feature selection we are able to improve prediction quality and we

suggest a novel way of exploring neural data.

Finally, chapter 5 extends the standard framework of feature selection to consider gen-

eralization in the features axis. The goal of standard feature selection is to select a subset

of features from a given set of features. Here, instead of trying to directly determine which

features are better, we attempt to learn the properties of good features. For this purpose we

assume that each feature is represented by a set of properties, referred to as meta-features.


This approach has three main advantages. First, the selection problem can be considered

as a standard learning problem in itself. This novel viewpoint enables derivation of better

generalization bounds for the joint learning problem of selection and classication. Second,

it allows us to devise selection algorithms that can eciently explore for new good features in

the presence of a huge number of features. Finally, it contributes to a better understanding

of the problem. We also show how this concept can be applied in the context of inductive
transfer. We show that transferring the properties of good features between tasks might be

better than transferring the good features themselves. We illustrate the use of meta-features

in the dierent applications on a handwritten digit recognition problem.


Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3 Machine Learning, Articial Intelligence and Statistics . . . . . . . . 7

1.1.4 Machine Learning and Neural Computation . . . . . . . . . . . . . . . 8

1.1.5 Other Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Common paradigms for feature selection . . . . . . . . . . . . . . . . . 16

1.2.2 The Biological/Neuroscience Rationale . . . . . . . . . . . . . . . . . 21

1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Is Feature Selection Still Necessary? 24


2.1 Problem Setting and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Observations on The Optimal Number of Features . . . . . . . . . . . 28

2.3 Specic Choices of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 SVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

x
CONTENTS xi

3 Margin Based Feature Selection 38


3.1 Nearest Neighbor classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Margins for 1-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Margin Based Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 Greedy Feature Flip Algorithm (G-ip) . . . . . . . . . . . . . . . . . 46

3.5.2 Iterative Search Margin Based Algorithm (Simba) . . . . . . . . . . . 47

3.5.3 Comparison to Relief . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5.4 Comparison to R2W2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Empirical Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7.1 The Xor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7.2 Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7.3 Reuters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.7.4 Face Images with Support Vector Machines . . . . . . . . . . . . . . . 59

3.7.5 The NIPS-03 Feature Selection Challenge . . . . . . . . . . . . . . . . 62

3.8 Relation to Learning Vector Quantization . . . . . . . . . . . . . . . . . . . . 63

3.9 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A Complementary Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 66

4 Feature Selection For Regression 69


4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 The Feature Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Testing on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Hand Movement Reconstruction from Neural Activity . . . . . . . . . . . . . 76

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Learning to Select Features 81


5.1 Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Predicting the Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . 84


xii CONTENTS

5.3 Guided Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 Meta-features Based Search . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.2 Illustration on Digit Recognition Task . . . . . . . . . . . . . . . . . . 88

5.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.1 Generalization Bounds for Mufasa Algorithm . . . . . . . . . . . . . . 92

5.4.2 VC-dimension of Joint Feature Selection and Classication . . . . . . 96

5.5 Inductive Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.5.1 Demonstration on Handwritten Digit Recognition . . . . . . . . . . . 100

5.6 Choosing Good Meta-features . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7 Improving Selection of Training Features . . . . . . . . . . . . . . . . . . . . . 104

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A Notation Table for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Epilog 108

List of Publications 111

Bibliography 113

Summary in Hebrew I
Chapter 1

Introduction

This introductory chapter provides the context and background for the results discussed in

the following chapters and denes some crucial notation. The chapter begins with a brief

review of machine learning, which is the general context for the work described in this thesis.

I explain the goals of machine learning, present the main learning models currently used in

the eld, and discuss its relationships to other related scientic elds. Then, in section 1.2, I

review the eld of feature selection, which is the subeld of machine learning that constitutes

the core of this thesis. I outline the rationale for feature selection, the dierent paradigms

that are used and survey some of the most important known algorithms. I show the ways

in which research in feature selection is related to biology in section 1.2.2.

1.1 Machine Learning


Machine learning deals with the theoretic, algorithmic and applicative aspects of l earning
from examples. In a nutshell, learning from examples means that we try to build a machine

(i.e. a computer program) that can learn to perform a task by observing examples. Typically,

the program uses the training examples to build a model of the world that enables reliable

predictions. This contrasts with a program that can make predictions using a set of pre-

dened rules (the classical Articial Intelligence (AI) approach). Thus in machine learning
the machine must learn from its own experience, and in that sense it adheres to the very

1
2 Chapter 1. Introduction

old - but still wise - proverb of our Sages of Blessed Memory (Hazal):

oeiqp lrak mkg oi`

which translates to: experience is the best teacher. This proverb implies that a learner

needs to acquire experience on his or her own in order to achieve a good level of performance

and not be satised with the explanations given by the teacher (pre-dened rules in our case).

I elaborate on the advantages of this approach in section 1.1.3.

1.1.1 Supervised Learning


While there are a few dierent learning frameworks in machine learning (see section 1.1.5),

we focus here on supervised learning. In supervised learning we get a labeled sample as input

and use it to predict the label of a new unseen instance. More formally, we have a training
( )
set S m = {xi , y i }m
i=1 , x ∈ R
i N
and y i = c xi where c is an unknown, but xed function.

The task is to nd a mapping h from RN to the label set with a small chance of erring on

a new unseen instance, x ∈ RN , that was drawn according to the same probability function

as the training instances. c is referred to as the target concept and the N coordinates are

called features. An important special case is when c has only a nite number of categorical

values. In this case we have a classication problem . In most of this dissertation we focus

on classication (except in chapter 4).

The conventional assessment of the quality of a learning process is the generalization

ability of the learned rule, i.e. how well it can predict the value of the target concept

on a new unseen instance. We focus on inductive batch 1


learning , and specically on the

Probably Approximately Correct (PAC) learning model [118, 13], where it is assumed that

the training instances are drawn iid according to a xed (but unknown) distribution D.

Thus the generalization ability of a classication rule h is measured by its generalization


error, which is the chance to err on a new unseen instance that was drawn according to the
same distribution D. Formally it is dened as follows:

eD (h) = P rx∼D (h (x) 6= c (x))

However, we usually cannot measure this quantity directly. Thus we look at the training
1 see section 1.1.5 for a short review of other common learning models
1.1. Machine Learning 3

error, where the training error of a classier h with respect to a training set S of size m is

the percentage of the training instances that h err on; formally dened as follows:

1 ∑( )
eS (h) = 1 − δh(xi ),c(xi )
m i

where xi is the i's training instance and δ is Kronecker's delta.

From the theoretic point of view, machine learning attempts to characterize what is

learnable, i.e. under which conditions a small training error guarantees a small general-

ization error. Such a characterization can give us insights into the theoretic limitations

of learning that any learner must obey. It is clear that if h is allowed to be any possible

function, there is no way to bound the gap between generalization error and training error.

Thus we have to introduce some restrictive conditions. The common assumption is that we

choose our hypothesis from a given class of hypotheses H. The classical result here is the

characterization of the learnable hypotheses classes in the PAC model [118, 13]. Loosely

speaking, a hypotheses class is PAC-learnable if there is an algorithm that ensures that the

gap between the generalization error and the training error is arbitrarily small when the

number of training instances is large enough. The characterization theorem states that a

class is learnable if and only if it has a nite VC-dimension. The VC-dimension is a com-

binatorial property which measures the complexity of the class. The bound is tighter when

the VC-dimension is smaller, i.e. when the hypotheses class is simpler. Thus, this result is

a demonstration of principle of Occam's razor: lex parsimoniae (law of succinctness):

entia non sunt multiplicanda praeter necessitatem,

which translates to:

entities should not be multiplied beyond necessity.

Which often rephrased as: The simplest explanation is the best one .

Another kind of bounds on the generalization error of a classier are data dependent
bounds [103, 9]. Whereas standard bounds depends only on the size of the training set,

data dependent bounds take advantage of the fact that some training sets are better than

others. That is, the better the training set, the better bounds it gives on the generalization

error. This way, the data give bounds which are tighter than the standard bounds. A

main component in data dependent bounds is the concept of margins. Generally speaking,
4 Chapter 1. Introduction

a margin [14, 100] is a geometric measure for evaluating the condence of a classier with

respect to its decision. We elaborate on the denition and usage of margins in chapter 3.

We make extensive use of data dependent bounds in our theoretical results in chapter 3 and

chapter 5.

Algorithmically speaking, machine learning tries to develop algorithms that can nd a

good rule (approximation of the target concept) using the given training examples. Ideally,

it is preferable to nd algorithms for which it is possible to prove certain guarantees on the

running time and the accuracy of the result rule. However, heuristic learning algorithms that

just work are also abundant in the eld, and sometimes in practice they work better than

the alternative provable algorithms. One possible algorithmic approach to (classication)

supervised learning is to build a probabilistic model for each class (using standard statistic

tools) and then classify a new instance into the class with highest likelihood. However,

since many dierent probabilistic models imply the same decision boundary, it can be easier

to learn the decision boundary directly (the discriminative approach). Two such classic

algorithms are the Perceptron [98] and the One Nearest Neighbor (1-NN) [32] that were

introduced half a century ago and are still popular. The Perceptron directly nds a hyper-

plane that correctly separates the training instances by going over the instances iteratively

and updating the hyper-plane direction whenever the current plane errs. 1-NN on the other

hand simply stores the training instances and then assigns a new instance with the label of

the closest training instance.

The most important algorithm for supervised learning in current machine learning is the

Support Vector Machine (SVM) [14, 116], which nds the hyper-plane that separates the

training instances with the largest possible margin, i.e., the separating plane which is as

far from the training instances as possible. By an implicit mapping of the input vectors

to a high dimension Hilbert space (using a kernel function) SVM provides a systematic

tool for nding non-linear separations using linear tools. Another prominent algorithm for

supervised classication is the AdaBoost [33], which uses a weak-learner to build a strong

classier. It builds a set of classiers by re-running the weak classier on the same training

set, but each time putting more weight on instances where the previous classiers erred.

Many other algorithms for supervised learning exist and I only mention some of the most
1.1. Machine Learning 5

important ones.

In terms of applications machine learning tries to adapt algorithms to a specic task.

Learning algorithms are widely used for many tasks both in industry and in academia, e.g.,

face recognition, text classication, medical diagnosis and credit card fraud detection, just

to name a few. Despite the fact that many learning algorithms are general and can be

applied to any domain, out-of-the-box learning algorithms do not generally work very well

on given real-world problems. Aside from the required tuning, each domain has its own

specic diculties and further eort may be required to nd the right representation of

instances (see section 1.1.2).

1.1.2 Representation

A key question in machine learning is how to represent the instances. In most learning

models it is assumed that the instances are given as a vector in RN (where N is any nite

dimension), and the analysis starts from there. There is a general consensus that once

we have a good representation, most reasonable learning methods will perform well after a

reasonable tuning eort. On the other hand, if we choose a poor representation achieving

a good level of performance is hopeless. But how do we choose the best way to represent

an abstract object (e.g. image) by a vector of numbers? A good representation should

be compact and meaningful at the same time. Is there a general method to nd such a

representation? Choosing a representation means choosing a set of features to be measured

on each instance. In real life, this set of features is usually chosen by a human expert in

the relevant domain who has a good intuition of what might work. The question is whether

it is possible to nd algorithms that use the training sample (and possibly other external

knowledge) in order to nd a good representation automatically.

If the instances are physical entities (e.g. a human patients), choosing the features means

choosing which physical measurements to perform on them. In other cases the instances are

given as a vectors of numbers in the rst place (e.g. the gray level of pixels of digital image)

and then the task of nding good representation (i.e. good set of features) is the task of

nding a transformation that convert the original representation into a better one. This can
6 Chapter 1. Introduction

be done without using the labels, in unsuprevised manner (see section 1.1.5) or using the

labels. If the instances originally described by a large set of raw features, one way to tackle

this is by using dimensionality reduction. In dimensionality reduction we look for a small

set of functions (of the raw features) that capture the relevant information. In the context

of supervised learning, dimensionality reduction algorithms try to nd a small number of

functions that preserve the information on the labels.

Feature selection is a special form of dimensionality reduction, where we limit ourselves

to choosing only a subset out of the given set of raw features. While this might seems to

be strong limitation, feature selection and general dimensionality reduction are not that

dierent, considering that we can always rst generate many possible functions of the raw

features (e.g. many kinds of lters and local descriptors of an image) and then use feature

selection to choose only some of them. This process of generating complex features by

applying functions on the raw features is called feature extraction. Thus, in other words,

using feature extraction followed by feature selection we can get a general dimensionality

reduction. Nevertheless, in feature extraction we have to select which features to generate out

of the huge (or even innite) number of possible functions of the raw features. We tackle this

issue in chapter 5. As will be explained in more detail in the section 1.2, the main advantages

of feature selection over other dimensionality reduction methods are interpretabilty and

economy (as it saves the cost of measuring the non- selected features).

The holy grail is to nd a representation which is concise and allow classication by

a simple rule at the same time, since a simple classier over low dimension generalize well.

However, this is not always possible, and therefore the is a tradeo between conciseness the

representation and complexity of the classier. Dierent methods may choose dierent work

point of this tradeo. While dimensionality reduction methods focus on conciseness, SVM
2

(and other kernel machines) convert the data into a very sparse representation in order to

allow very simple (namely linear) classication rule and control overtting by maximizing

the margin. However, In chapter 2 we suggest that the ability of SVM to avoid overtting on

high dimensional data is limited to scenarios where the data is laying on a low dimensional

manifold.

2 See section 1.1.1 for a short description of SVM


1.1. Machine Learning 7

Another related issue in data representation is the tradeo between the conciseness of the

representation and its potential prediction accuracy. One principled way for quantifying this

tradeo, known as the Information Bottleneck [111], is to measure both the complexity of the
model and its prediction accuracy by using Shannon's mutual information, measuring both

complexity and accuracy by bits. During my PhD studies I also took major part in research

on the relation between the Information Bottleneck framework and some classical problems

in Information Theory (IT) such as a version of source coding with side information presented
by Wyner, Ahlswede and Korner (WAK) [121, 3], Rate-Distortion and Cost-Capacity [102].

In this research we took advantage of the similarities to obtain new results and insights both

on the IB and the classical IT problems. These results are not included in this thesis due to

space limitations. They where published in [39] and are under review process for publication

in IEEE-IT.

1.1.3 Machine Learning, Articial Intelligence and Statistics

Machine learning can also be considered a broad sub-eld of Articial Intelligence (AI). AI

refers to the ability of an articial entity (usually a computer) to exhibit intelligence. While

AI in general tends to prompt profound philosophical issues, as a research eld in computer

science it is conned to dealing with the ability of a computer to perform a task that is

usually done by humans and is considered as a task that requires intelligence. A classical

AI approach tackles such a task by using a set of logical rules (which were designed by a

human expert) that constitute a ow chart. Such systems are usually referred to as Expert
Systems, Rule-Based Systems or Knowledge-Based Systems [51]. The rst such systems were

developed during the 1960s and 1970s and became very popular and applied commercially

during the 1980s [76]. On the other hand, as already mentioned, in machine learning the

computer program tries to derive the rules by itself using a set of input-output pairs (aka

training set ). Thus machine learning focuses on the learning process.


For example, take the task of text classication. Here the task is to classify free-language

texts into one or more classes from a predened set of classes. Each class can represent

a topic, an author or any other property. A rule-based system for such a task usually
8 Chapter 1. Introduction

consists of pre-dened rules that query the existence of some words or phrases in the text

to be classied. The rules are extracted by humans, i.e., the programmer together with

an expert in the relevant eld who knows how to predict which phrases characterize each

class. There are several main drawbacks to this approach. First, the task of dening the

set of rules requires a great deal of expert human work and becomes virtually impossible

for large systems. Moreover, even if we already have such a system that actually works,

it is very hard to maintain it. Imagine that we want to add a new class to a system with

a thousand rules. It is very hard to predict the eect of changing even one rule. This is

particularly crucial because the environment changes over time, e.g., new classes appear and

old classes disappear, the language changes (who used the word tsunami before December

26, 2004?), and it is very awkward to adapt such a system to such changes. The machine

learning approach overcomes the above drawbacks since the system is built automatically

from a labeled training sample and the rule can be adapted automatically with any feedback

(i.e., an additional labeled instance). However, machine learning raises other problems such

as the interpretability of the decision and the need for a large amount of labeled instances.

These issues will be discussed in more detail in the following.

Machine learning is also closely related to statistics. Both elds try to estimate an un-

known function from a nite sample; thus there is a large overlap between them. Indeed

many statisticians claim that some of the results that are considered new in machine learn-

ing are well known in the statistics community. However, while the formal framework might

be similar, the applicative point of view is very dierent. While classical statistics focuses

on hypothesis testing and density estimation, machine learning focuses on how to create

computer programs that can learn to perform an intelligent task. One additional dier-

ence is that machine learning, as sub-eld of computer science, is also concerned with the

algorithmic complexity of computational implementations.

1.1.4 Machine Learning and Neural Computation

The brain of any animal and especially the human brain is the best known learner. Over

the course of a lifetime the brain has to learn how to perform a huge number of new, com-
1.1. Machine Learning 9

plicated tasks and has to solve computational problems continuously and instantaneously.

For example the brain has to resolve the visual input that comes through the visual system

and produce the correct scene. This involves such problems as segmentation (which pixel

belongs to which object), detection of objects and estimation of objects' movements. Cor-

rect resolving is crucial for the creature's survival. There are several dierent approaches

to study the brain as a computational machine. One is to look inside the brain and try to

see how it works (Physiology); another is to try to build an abstract model of the brain

(regardless of the specic hardware) that ts the observations and can predict the brain's

behavior (Cognitive Psychology). The common sense behind using machine learning for

brain research is the belief that if we are able build a machine that can deal with the same

computational problems the brain has to solve, it will teach us something about how the

brain works. As regards theory, machine learning can provide the limitations on learning

process that any learner must obey , including the brain.

1.1.5 Other Learning Models


In life, dierent learning tasks can be very dierent from each other in many respects,

including the kind of input/feedback we get (instances alone or instances with some hints)

, the way we are tested (during the learning process or only at the end, number of allowed

mistakes) and the level of control we have over the learning process (can we aect the world or

just observe it?). Thus, many dierent models have been suggested for formalizing learning.

As already explained, this work focuses on passive-inductive-batch-supervised learning that

was presented in section 1.1.1. However, we refer to some other models as well. For this

reason and in order to give a more complete picture of the machine learning eld I briey

overview the other main models and point out the dierences between them.

Supervised vs. Unsupervised Learning

As already mentioned, in supervised learning the task is to learn an unknown concept c

from examples, where each instance x comes together with the value of the target function

for this instance, c (x). Thus the task is to nd an approximation of c, and performance
10 Chapter 1. Introduction

is measured by the quality of the approximation for points that did not appear in the

training set. On the other hand, in unsupervised learning the training sample includes

only instances, without any labels, and the task is to nd interesting structures in the

data. Typical approaches to unsupervised learning include clustering, building probabilistic


generative models and nding meaningful transformations of the data. Clustering algorithms

try to cluster the instances into a few clusters such that instances inside the same cluster

are similar and instances in two dierent clusters are dierent. A classical clustering

algorithm is the k -means [77] which clusters instances according to the Euclidean distance.

It starts with random locations of k cluster centers and then iteratively assigns instances

to the cluster of the closest center, and updates the centers' locations to the center of

gravity of the assigned instances. Clustering is also one of the more obvious applications

of the Information-Bottleneck [111], which was already mentioned in section 1.1.2. The

Information-Bottleneck is an information theoretical approach for extracting the relevant

information in one variable with respect to another variable, and thus can be used for

nding a clustering of one variable that preserves the maximum information on the other

variable. Generative models methods assume that the data were drawn from a distribution

of a certain form (e.g. mixture of Gaussians) and looks for the parameters that maximize

the likelihood of the data. The prominent algorithm here is the Expectation Maximization
(EM) [26] family of algorithms. Principal Component Analysis (PCA) [56] is a classic

example of an algorithm that looks for an interesting transformation of the data. PCA

nds the linear transformation into a given dimension that preserves the maximal possible

variance. Other prominent algorithms of this type are Multidimensional Scaling (MDS) [67],

Projection Pursuit [35, 50, 36, 57], Independent Component Analysis (ICA) [11] and the

newer Local Linear Embedding (LLE) [99].

Inductive vs transductive
Supervised learning can be divided into inductive learning and transductive learning [117].

In inductive learning we want our classier to perform well on any instance that was drawn

from the same distribution as the training set. On the other hand, in transductive learning it

is assumed that the test instances were known at training time (only the instances, not their
1.1. Machine Learning 11

labels of course), and thus we want to nd a classier that performs well on these predened

test instances alone. These two models comply with dierent real life tasks. The dierence

can be illustrated on the task of text classication. Assume that we want to build a system

that will classify emails that will arrive in the future using a training set of classied emails

we received in the past. This ts the inductive learning model, since we do not have the

future emails at the training stage. On the other hand if we have an archive of a million

documents, where only one thousand of them are classied and we want to use this subset

to build a classier for classifying the rest of the archive, this ts the transductive model.

In other words, transductive learning ts situations where we have the test questions while

we are studying for the exam.

Batch vs. Online Learning


We can also divide supervised learning into batch learning and online learning [72]. In batch

learning it is assumed that the learning phase is separate from the testing phase, i.e., we

rst get a batch of labeled instances, we use them to learn a model (e.g. classier) and then

we use this model for making predictions on new instances. When analyzing batch models

it is usually assumed that the instances were drawn iid from an underlying distribution and

that the test instances will be drawn from the same distribution. The accuracy of a model

is dened as the expected accuracy with respect to this distribution. However, this model

does not t many real life problems, where we have to learn on the go and do not have

a sterile learning phase, i.e., we must be able to make predictions from the very beginning

and we pay for errors in the learning phase as well. Online learning assumes that we get

the instances one by one. We rst get the instance without a label and have to make a

prediction. Then we get the real label, and suer a loss if our prediction was wrong. Now

we have the chance to update our model before getting another instance and so on. In

analyzing online algorithms, performance is measured by the cumulative loss (i.e., the sum

of losses we suered so far), and the target is to achieve a loss which is not much more

than the loss made by the best model (out of the class of models we work with). Thus, in

online analysis, the performance measure is relative, and it is not necessary to assume any

assumptions on the way that the instances were produced. See [25] for a discussion on the
12 Chapter 1. Introduction

dierences, analogies and conversions between batch and online analysis.

Passive vs. Active Learning


In passive learning, the learner can only observe the instances given by the teacher. On the

other hand, in active learning models the learner has the opportunity to guide the learning

process in some way. Dierent active learning models dier in the way the learner can

aect the learning process (i.e. the instances s/he will get). For example in membership
query [6] model the learner can generate instance by him or herself and ask the teacher

for their label. This may be problematic in some domains, as the learner can generate

meaningless instances [68]. Another example is the selective sampling model [21]. Here the

learner observes the instances given by the teacher and decides which of them s/he wants

to get the label for. This model is most useful when it is cheap to get instances, but it

is expensive to label them. It can be shown that under some conditions active learning

can reduce exponentially the number of labels required to ensure a given level of accuracy

(compared to passive learning) ([34]). See [38] for an overview on active learning.

Reinforcement Learning
In many real life situations things are not so simple. The game is not just to predict some

label. Sometimes you are required to choose from a set of many actions, and you may be

rewarded or be penalized for your action and the action may aect the environment. The

Reinforcement learning (see e.g. [59]) model tries to capture the above observations and

to present a more complete model. It is assumed that the world has a state. At each

step the learner (aka agent in this context) takes an action which is chosen out of a set of

possible actions. The action may aect the world state. In each step the agent gets a reward

which depends on the world state and the chosen action. The agent's goal is to maximize

the cumulative reward or discounted reward, where discounted reward put more emphasis

on the reward that will be given in the near future. The world's current state may be

known to the agent (the fully observed model) or hidden (the partially observed model).

When the world's state is hidden, the agent can only observe some noisy measurement of

it. Under some Markovian assumptions it is possible to nd a good strategy eciently.
1.1. Machine Learning 13

The reinforcement learning model is popular in the context of robotics, where people try to

create a robot that can behave reasonably in a changing environment.

Inductive transfer

Another observation about human learning is that a human does not learn isolated tasks,

but rather learns many tasks in parallel or sequentially. Inductive transfer [10, 110, 16]

is a learning model that captures this kind of characteristic of human learning. Thus, in

inductive transfer (a.k.a learning to learn, transfer learning or task2task ) one tries to use

knowledge gained in previous tasks in order to enhance performance on the current task.

In this context, the main question is the kind of knowledge we can transfer between tasks.

This is not trivial, as it is not clear how we can use, for an example, images which are

labeled cat or not cat in order to learn to classify images to table or not table. One

option, which is very popular, is to share the knowledge about the representation, i.e. to

assume that the same kind of features or distance measure is good for all the classes. For

example, [110] uses the previous tasks to learn a distance measure between instances. This

is done by constructing a training set of pairs of instances, where a pair is labeled 1 if and
only if we can be sure that they have the same label (i.e. they are from the same task and

have a positive label for the class of interest in this task). This make sense as the goal is

to nd a distance measure for which instances of the same class are close and instances of

dierent classes are well apart. They use Neural Network to learn the distance measure

from the pairs' training set. Another option, which interests us, is to share the knowledge

on which features are useful. We elaborate on this in chapter 5, where we show how our

new approach to feature selection can be applied to inductive transfer. Several authors have

noted that sometimes transferring knowledge can hurt performance on the target problem.

Two problems are considered as related if the transfer improves the performance and non-

related otherwise. However, it is clear that this notion is not well dened as the behavior

depends on the kind of information we choose to transfer, or more generally on the learning

algorithm [16].
14 Chapter 1. Introduction

1.2 Feature Selection


In many supervised learning tasks the input is represented by a very large number of features,

many of which are not needed for predicting the labels. Feature selection (variously known

as subset selection, attribute selection or variable selection ) is the task of choosing a small

subset of features that is sucient to predict the target labels well. The four main reasons

to use feature selection are:

1. Reduced computational complexity. Feature selection reduces the computational

complexity of learning and prediction algorithms . Many popular learning algorithms

become computationally intractable in the presence of huge numbers of features, both

in the training step and in the prediction step. A preceding step of feature selection

can solve the problem.

2. Economy. Feature selection saves on the cost of measuring non selected features .
Once we have found a small set of features that allows good prediction of the labels,

we do not have to measure the rest of the features any more. Thus, in the prediction

stage we only have to measure a few features for each instance. Imagine that we want

to predict whether a patient has a specic disease using the results of medical checks.

There are a huge number of possible medical checks that might be predictive; let's

say that there are 1000 potential checks and that each of them costs ten dollars to

perform. If we can nd a subset of only 10 features that allows good performance, it

saves a lot of money, and may turn the whole thing from an infeasible into a feasible

procedure.

3. Improved accuracy. In many situations, feature selection can also enhance predic-

tion accuracy by improving the signal to noise ratio. Even state-of-the-art learning

algorithms cannot overcome the presence of a huge number of irrelevant or weakly

relevant features. On the other hand once a small set of good features has been found,

even very simple learning algorithms may yield good performance. Thus, in such

situations, an initial step of feature selection may improve accuracy dramatically.

4. Problem understanding. Another benet of feature selection is that the identity of


1.2. Feature Selection 15

the selected features can provide insights into the nature of the problem at hand. This

is signicant since in many cases the ability to point out the most informative features

is more important than the ability to make a good prediction in itself. Imagine we are

trying to predict whether a person has a specic type of cancer using gene expression

data. While we can know whether the individual is sick or not in other ways, the

identity of the genes which are informative for prediction may give us a clue to the

disease mechanism involved, and help in developing drugs.

Thus feature selection is an important step in ecient learning of large multi-featured data

sets. Regarding reasons 2 and 4 above, feature selection has an advantage over other general

dimensionality reduction methods. Computing general functions of all the input features

means that we must always measure all the features rst, even if in the end we calculate

only a few functions of them. In many problems, dierent features vary in terms of their

nature and their units (e.g. body temperature in Celsius, yearly income in dollars). In these

cases feature selection is the natural formulation and it enables a better interpretation of

the results, as the meaning of the combination of the features is not very clear.

Many works dene the task of feature selection is detecting which features are relevant

and which are irrelevant. In this context we need to dene relevancy. This is not straight-

forward, because the eect of a feature depends on which other features we have selected.

Almuallim and Dietterich [4] provide a denition of relevance for the binary noise-free case.

They dene a feature to be relevant if it appears in any logical formula that describes the

target concept. Gennari et al. [37] suggest a probabilistic denition of relevancy, which de-

nes a feature to be relevant if it aects the conditional distribution of the labels. John et

al [55, 62] dene the notion of strong and weak relevance of a feature. Roughly speaking,

a strongly relevant feature is a useful feature that cannot be replaced by any other feature

(or set of features) whereas a weakly relevant feature is a feature which is useful, but can be

replaced by another feature (or set of features). They also show that, in general, relevance

does not imply optimality and that optimality does not imply relevance. Moreover, as we

show in chapter 2, even if a feature is relevant and useful, it may detract in the presence

of other features. Thus, when the concern is prediction accuracy, the best denition of the
16 Chapter 1. Introduction

task of feature selection as choosing a small subset of features that allows for good predic-

tion of the target labels. When the goal is problem understanding (reason 4 above), the

relevancy of each feature alone might be important. However, in real life problems features

are rarely completely irrelevant and thus it might be better to inquire about the importance

of each feature. Cohen et al. [20] suggested the Shapley value of a feature as a measure of

its importance. The Shapley value is a quantity taken from game theory; where it is used

to measure the importance of a player in a cooperative game.

Below I review the main feature selection paradigms and some of the immense number

of selection algorithms that have been presented in the past. However, a complete review of

feature selection methods is beyond the scope of this thesis. See [45] or [43] for a comprehen-

sive overview of feature selection methodologies. For a review of (linear) feature selection

for regression see [82].

1.2.1 Common paradigms for feature selection

Many dierent algorithms for the task of feature selection have been suggested over the last

few decades both in the Statistics and in the Learning community. Dierent algorithms

present dierent conceptual frameworks. However, in the most common selection paradigm

an evaluation function is used to assign scores to subsets of features and a search algo-
rithm is used to search for a subset with a high score. See gure 1.1 for an illustration of
this paradigm. Dierent selection methods can be dierent both in the choice of evaluation

function and the search method. The evaluation function can be based on the performance

of a specic predictor ( wrapper model, [62]) or on some general (typically cheaper to com-

pute) relevance measure of the features to the prediction ( lter model, [62]). The evaluation

function is not necessarily a black box and in many cases the search method can use

information on the evaluation function in order to perform an ecient search.

In most common wrappers, the quality of a given set of features is evaluated by testing

the predictor performance on a validation set. The main advantage of a wrapper is that

you optimize what really interest you - the predictor accuracy. The main drawback of

such methods is their computational deciency that limits the number of sets that can be
1.2. Feature Selection 17

Initial Selected
feature set features
Search
method

score feature
subset

Evaluation
function

Figure 1.1: The most common selection paradigm: An evaluation function is used to
evaluate the quality of subsets of features and a search engine is used for nding a subset
with high score.

evaluated. The computational complexity of wrappers is high since we have to re-train

the predictor in each step. Common lters use quantities like conditional Variance (of

the features given the labels), Correlation Coecients or Mutual Information as a general

measure. In any case (wrapper or lter), an exhaustive search over all feature sets is generally

intractable due to the exponentially large number of possible sets. Therefore, search methods

are employed which apply a variety of heuristics. Two classic search methods are [78]:

1. Forward selection: start with an empty set of features and greedily add features one

at a time. In each step, the feature that produces the larger increase of the evaluation

function (with respect to the value of the current set) is added.

2. Backward Elimination: Start with a set of features that contains all the features

and greedily remove features one at a time. In each step the feature whose removal

results in the larger increase (or smaller decrease) in the evaluation function value is

removed.
18 Chapter 1. Introduction

Backward elimination has the advantage that when it evaluates the contribution of a feature

it takes into consideration all the other potential features. On the other hand, in forward

selection, a feature that was added at one point can become useless later on and vice versa.

However, since evaluating small sets of features is usually faster than evaluating large sets,

forward selection is much faster when we are looking for small number of features to select.

Moreover, if the initial number of features is very large, backward elimination become infea-

sible. A combination of the two is also possible of course, and has been used in many works.

Many other search methods such as stochastic hill climbing, random permutation [87] and

genetic algorithms [48] can be used as search methods as well.

I turn now to review a few examples of some of the more famous feature selection

algorithms. Almuallim and Dietterich [4] developed the FOCUS family of algorithms that

performs an exhaustive search in the situation where both the features and labels are binary

(or at least have small number of possible values). All feature subsets of increasing size are

evaluated, until a sucient set is found. A set is sucient if there are no conicts, i.e. if

there is no pair of instances with the same feature values but dierent labels. They also

presented heuristic versions of FOCUS that look only at promising subsets, and thus make

it feasible when the number of features is larger. Kira and Rendell [61] presented the RELIEF

algorithm, a distance based lter that uses a 1-Nearest Neighbor (1-NN) based evaluation

function (see chapter 3 for more details). Koller and Shamai [64] proposed a selection

algorithm that uses a Markov blanket [89]. Their algorithm uses backward elimination,

where a feature is removed if it is possible to nd a Markov blanket for it, or in other words, if

it possible to nd a set of features such that the feature is independent of the labels given this

set. Pfahringer [91] presented a selection method which is based on the Minimum Description

Length principle [97]. Vafaie and De Jong [115], used the average Euclidean distance between

instances in dierent classes as an evaluation function and genetic algorithms [48] as a search

method.

The above algorithms are all lters. The following are examples of wrappers. Aha and

Bankert [1] used a wrapper approach for instance-based learning algorithms (instance-based

algorithms [2] are extensions of 1-Nearest Neighbor [32, 22]) combined with a beam search

strategy, using a kind of backward elimination. Instead of starting with an empty or a full
1.2. Feature Selection 19

set, they select a xed number of subsets (with a bias toward small sets) and start with the

best one among them. Then they apply backward elimination while maintaining a queue

of features which is ordered according to the contribution of a feature the last time it was

evaluated. Skalak [106] used a wrapper for the same kind of classiers, but with random

permutation instead of deterministic search. Caruana and Freitag [17] developed a wrapper

for decision trees [93, 94] with a search which is based on adding and removing features

randomly, but also removing in each step all the features which were not included in the

induced tree. Bala et al. [8] and Cherkauer and Shavlik [19] used a wrapper for decision

trees together with genetic algorithms as search strategies. Weston et al. [120] developed a

sophisticated wrapper for SVMs. Their algorithm minimizes a bound on the error of SVM

and uses gradient descent to fasten the search, but it still has to solve the SVM quadratic

optimization problem in each step, and in this sense it is a wrapper.

Individual feature ranking

Other feature selection methods simply rank individual features by assigning a score to each

feature independently. These methods are usually very fast, but inevitably fail in situations

where only a combined set of features is predictive of the target function. Two of the most

common rankers are:

1. Infogain [93], which assigns to each feature the Mutual Information between the feature

value and the label, considering both the feature and the label as random variables and

using any method to estimate the joint probability. Formally, let P be an estimation

of the joint probability for a feature, i.e., Pij represents the probability that both f

equal its i's possible value and the label is j, then we get the following formula for the

infogain of the feature f:

∑ Pij
IG (f ) = Pij log (∑ ∑ )
i,j i Pij j Pij

The most common way to estimate P is by simply counting the percent of the in-

stances that present each value pair (the empiric distribution) with some kind of

zero correction (e.g., adding a small constant to each value, and re-normalizing).
20 Chapter 1. Introduction

2. Correlation Coecients (corrcoef ), which assign each feature the Pearson correlation

between the vector of values the feature got in the training set (v ) and the vector of

the labels (y ) [92] i.e.,

E (v − E (y) (y − E (y)))
CC (f ) = √
V ar (v)V ar (y)
where the expectation and the variance are estimated empirically of course.

Corrcoef has the advantage (over infogain) that it can be used with continuous values (of

features or the label) without any need of quantization. On the other hand, Infogain has

the advantage that it does not assume any geometric meaning of the values, and thus can

work for any kind of features and label, even if they are not numerical values.

Many other individual feature rankers have been proposed. Among them are methods

which use direct calculations on the true-positive, true-negative, false-negative and false-

positive which are obtained using the given features. Other methods use dierent kinds

of distance measures between distributions to calculate the distance between the empirical

joint distribution and the expected one (assuming independency between the feature and

the label). Infogain is an example of such a method, with Kullback & Leibler divergence

(DKL ) as the distance measure. Other methods use statistical tests including the chi-square

([73, 74]), t-test ([107]) or the ratio of between-class variance to within-class variance, known

as the Fisher criterion ([29]). See [45], chapter 3 for a detailed overview of individual feature

ranking methods.

Embedded approach
The term Embedded methods is usually used to describe selection which is done automatically
by the learning algorithm. Note that almost any wrapper can be considered as an embedded

method, because selection and learning are interleaved. Decision tree [93, 94] learning can

also be considered to be an embedded method, as the construction of the tree and the

selection of the features are interleaved, but the selection of the feature in each step is

usually done by a simple lter or ranker. Other authors use the term embedded method

to refer only to methods where the selection is done by the learning algorithm implicitly. A

prominent example of such a method is the L1-SVM [86].


1.2. Feature Selection 21

1.2.2 The Biological/Neuroscience Rationale

Feature selection is carried out by many biological systems. A sensory system is exposed to

an innite number of possible features, but has to select only some of them. For example,

the number of potential details in a visual scene is innite and only some (albeit a large

number) of them are perceived, even in the low level of the reticulum and V1. Moreover,

many sensory systems perform feature selection as part of the processing of the data. For

example, higher levels in the visual path have to choose from among the huge number of

features that the receptive elds in V1 produce. Ullman et al. [114] state that the problem

of feature selection is a fundamental question in the study of visual processing. They show

that intermediate complexity (IC) features are the optimal choice for classication. Serre et

al. [101] show that selecting object class-specic features improves object recognition ability

in a model of biological vision. This suggests that feature selection in biological systems is

not something that done only once, but rather is a dynamic process which is a crucial part

of the never-ending learning process of a living creature.

Another example is selecting relevant spectral cues in the auditory system. When a sound

is perceived by the ear, the level of modulation of each frequency depends on the direction

the sound came from. These changes are called spectral cues and the auditory system of

a human uses them to determine the elevation of the sound source [47]. Feature selection

occurs in the olfactory pathway as well. As described by Mori et al. [83], the mammalian

olfactory sensory neurons detect many dierent odor molecules and send information (using

their axons) to the olfactory bulb (OB). Each glomerulus in the OB is connected to sensory

neurons of the same type and can be considered as representing a feature. Lateral inhibition

among glomerular modules is a kind of feature selection mechanism. Perera et al. [90]

developed a feature selection method which is inspired by the olfactory system. This is

an example of something that happens frequently, where researchers adopt ideas from the

biological system when developing computerized algorithms.


22 Chapter 1. Introduction

Feature Selection in service of biological research

Nowadays biological experiments produce huge amounts of data that cannot be analyzed

without the aid of computerized tools. Feature selection is a useful tool for data analysis.

For example, assume that we want to measure the activity of many neurons during a given

task and then produce a set of features that represents many dierent kinds of measurements

of the neurons' activity. Now we can use feature selection to nd which kind of neuron and

which kind of activity characteristic are most informative for predicting the behavior. We

demonstrate such use of feature selection in chapter 4, where we use feature selection on

data recorded from the motor cortex of a behaving monkey.

1.3 Notation

Although specic notation is introduced in the chapters as needed, the following general

notation is used throughout this thesis. Vectors in RN are denoted by boldface small letters

(e.g. x, w). Scalars are denoted by small letters (e.g. x, y ). The i'th element of a vector x is

denoted by xi . log is the base 2 logarithm and ln is the natural logarithm. m is the number

of training instances. N denotes the number of all potential features while n denotes the

number of selected features. Subsets of features are denoted by F and a specic feature is

usually denoted by f. A labeled training set of size m is denoted by Sm (or by S when m is

clear from the context). The table below summarizes the main notation we keep along the

thesis for a quick reference.


1.3. Notation 23

Notation short description

m number of training instances

N total number of features

n number of selected features

F set of features

f a feature

w a weight vector over the features

Sm training set of size m

c the target classication rule

h a hypothesis

D a distribution over instances (and labels) space

ˆ γS
er (h) γ- sensitive training error of hypothesis h

erD (h) generalization error of hypothesis h w.r.t. D


Chapter 2

Is Feature Selection Still

Necessary?1

As someone who has researched feature selection for the last several years, I have often heard

people doubt its relevancy with the following argument: a good learning algorithm should

know to handle irrelevant or redundant features correctly and thus an articial split of the

learning process into two stages imposes excessive limits and can only impede performance.

There are several good rebuttals to this claim. First, as explained in section 1.2, improved

classication accuracy is not the only rationale for feature selection. Feature selection is also

motivated by improved computational complexity, economy and problem understanding.

Second, research on feature selection is a part of the broader issue of data representation.

However, to provide a complete answer, in this chapter we directly discuss the issue of

whether there is a need to do feature selection solely for improving classication accuracy

(ignoring other considerations), even when using current state-of-art learning algorithms.

Obviously, if the true statistical model is known, or if the sample is unlimited (and the

number of features is nite), any additional feature can only improve accuracy. However,

when the training set is nite, additional features can degrade the performance of many

1 The results presented in this chapter were rst presented as a chapter in the book Subspace, Latent
Structure and Feature Selection, edited by Saunders, C. and Grobelnik, M. and Gunn, S. and Shawe-Taylor,
J. [84]

24
25

classiers, even when all the features are statistically independent and carry information

on the label. This phenomenon is sometimes called the peaking phenomenon and was

already demonstrated more than three decades ago by [54, 113, 96] and in other works

(see references there) on the classication problem of two Gaussian-distributed classes with

equal covariance matrices (LDA). Recently, [49] analyzed this phenomenon for the case where

the covariance matrices are dierent (QDA); however, this analysis is limited to the case

where all the features have equal contributions. On the other hand [53] showed that, in the

Bayesian setting, the optimal Bayes classier can only benet from using additional features.

However, using the optimal Bayes classier is usually not practical due to its computational

cost and the fact that the true prior over the classiers is not known. In their discussion, [53]

raised the problem of designing classication algorithms which are computationally ecient

and robust with respect to the feature space. Now, three decades later, it is worth inquiring

whether today's state-of-the-art classiers, such as Support Vector Machine (SVM) [14],

achieve this goal.

Here we re-visit the two-Gaussian classication problem, and concentrate on a simple

setting of two spherical Gaussians. We present a new, simple analysis of the optimal number

of features as a function of the training set size. We consider the maximum likelihood

estimation as the underlying classication rule. We analyze its error as function of the

number of features and number of training instances, and show that while the error may

be as bad as chance when using too many features, it approaches the optimal error if we

chose the number of features wisely. We also explicitly nd the optimal number of features

as a function of the training set size for a few specic examples. We test SVM empirically

in this setting and show that its performance matches the predictions of the analysis. This

suggests that feature selection is still a crucial component in designing an accurate classier,

even when modern discriminative classiers are used, and even if computational constraints

or measuring costs are not an issue.


26 Chapter 2. Is Feature Selection Still Necessary?

2.1 Problem Setting and Notation


First, let us introduce some notation. Vectors in RN are denoted by bold face lower case

letter (e.g. x, µ) and the j 'th coordinate of a vector x is denoted by xj . We denote the

restriction of a vector x to the rst n coordinates by xn .

Assume that we have two classes in RN , labeled +1 and −1. The distribution of the

points in the positive class is N ormal (µ, Σ = I) and the distribution of the points in the

negative class is N ormal (−µ, Σ = I), where µ ∈ RN , and I is the N ×N unit matrix.

To simplify notation we assume, without loss of generality, that the coordinates of µ are

ordered in descending order of their absolute value and that µ1 6= 0; thus if we choose to use

only n<N features, the best choice would be the rst n coordinates. The optimal classier,

i.e., the one that achieves the maximal accuracy, is h (x) = sign (µ · x). If we are restricted

to using only the rst n features, the optimal classier is h (x, n) = sign (µn · xn ).

We assume that the model is known to the learner (i.e. that there two antipode spherical

Gaussian classes) , but the model parameter, µ, is not known. Thus, in order to analyze

this setting we have to consider a specic way to estimate µ from a training sample Sm =
{ }
i m
xi , y i=1
, where xi ∈ RN and y i ∈ {+1, −1} is the label of xi . We consider the maximum

likelihood estimator of µ:
1 ∑ i i
m
µ̂ = µ̂ (S m ) = yx
m i=1

Thus the estimated classier is ĥ (x) = sign (µ̂ · x). For a given µ̂ and number of features

n, we look on the generalization error of this classier:

error (µ̂, n) = P (sign (µ̂n · xn ) 6= y)

where y is the true label of x. This error depends on the training set. Thus, for a given

training set size m, we are interested in the average error over all the possible choices of a

sample of size m:

error (m, n) = ES m error (µ̂ (Sm ) , n) (2.1)

We look for the optimal number of features, i.e. the value of n that minimizes this error:

nopt = arg min error (m, n)


n
2.2. Analysis 27

2.2 Analysis
For a given µ̂, and n the dot product µ̂n · xn is a Normal random variable on its own and

therefore the generalization error can be explicitly written as (using the symmetry of this

setting):

(
)
Ex (µ̂n · xn )
error (µ̂, n) = P ( µ̂n · xn < 0| + 1) = Φ − √ (2.2)
Vx (µ̂n · xn )
1
∫ a − 1 z2
here Φ is the Gaussian cumulative density function: Φ (a) = √ e 2 dz . We denote
2π −∞

by Ex and Vx expectation and variance with respect to the true distribution of x.

For a given number of features n, and a given sample S, we have


n
Ex (µ̂n · xn ) = µ̂n · µn = µ̂j µj
j=1

and

n ∑
n
Vx (µ̂n · xn ) = µ̂2j Vx (xj ) = µ̂2j
j=1 j=1

substituting in equation (2.2) we get:


 ∑ 
n
µ̂ j µ j
error (µ̂, n) = Φ − √∑ 
j=1
(2.3)
n 2
j=1 j µ̂

Now, for a given training set size m, We want to nd n that minimizes the average error

term ES m (error (µ̂, n)), but instead we look for n that minimizes an approximation of the

average error:
 )
(∑
n
 ES m µ̂j µj  j=1
nopt = arg min Φ 
 − √ (

) (2.4)
n ∑n 2
ES m j=1 µ̂j

We rst have to justify why the above term approximates the average error. We look at

the variance of the relevant terms (the numerator and the term in the square root in the

denominator). µ̂j is a Normal random variable with expectation µj and variance 1/m, thus
 
∑ n ∑n
1 ∑ 2
n
VS m  µ̂j µj  = µ2j VS m (µ̂j ) = µ −−−−→ 0 (2.5)
j=1 j=1
m j=1 j m→∞

 

n
2n 4 ∑ 2
n
VS m  µ̂2j  = + µ −−−−→ 0 (2.6)
j=1
m2 m α=1 j m→∞
28 Chapter 2. Is Feature Selection Still Necessary?
( ) ( )
where in the last equality we used the fact that if Z ∼ N ormal µ, σ 2 , then V Z2 =

2σ 4 + 4σ 2 µ2 . We also note that:


 

n n (
∑ )
ES m  µ̂2j  =
2
VS m (µ̂j ) + E (µ̂j )
j=1 j=1

∑n ( ) ∑n
1
= 2
µj + −−−−→ µ2j > 0
j=1
m m→∞
j=1

therefore the denominator is not zero for any value of m (including the limit m → ∞).

Combining this with (2.5) and (2.6) and recalling that the derivative of Φ is bounded by

1, we conclude that at least for a large enough m, it is a good approximation to move the

expectation inside the error term. However, we need to be careful with taking m to ∞, as

we know that the problem becomes trivial when the model is known, i.e., when n is nite

and m → ∞. Additionally, in our analysis we also want to take the limit n → ∞, and thus

we should be careful with the order of taking the limits. In the next section we see that the
∑n
case of bounded j=1 µ2j is of special interest. In this situation, for any ² > 0, there are

nite m0 and n0 such that the gap between the true error and the approximation is less than

² with a high probability for any (m, n) that satises m > m0 and n > n0 (including the

limit n → ∞). Thus, for any m > m0 , the analysis of the optimal number of features nopt

(under the constraint n > n0 ) as a function of m using the approximation is valid. When we

are interested in showing that too many features can hurt the performance, the constraint

n > n0 is not problematic. Moreover, gure 2.1 shows numerically that for various choices
∑n
of µ (including choices such that j=1 µ2j −−−−→ ∞), moving the expectation inside the
n→∞

error term is indeed justied, even for small value of m.

Now we turn to nding the n that minimizes (2.4), as function of the training set size

m. This is equivalent to nding n that maximizes


(∑ )
n ∑n ∑n
ES m j=1 µ̂j µj
2
µ2j
j=1 µj j=1
f (n, m) = √ (∑ ) = √∑n ( 2 ) = √n ∑n (2.7)
1
ES m
n 2 j=1 µj + + µ2j
j=1 µ̂j m m j=1

2.2.1 Observations on The Optimal Number of Features


√∑
n
First, we can see that, for any nite n, when m → ∞, f (n, m) reduces to j=1 µ2j

and thus, as expected, using all the features maximizes it. It is also clear that adding a
2.2. Analysis 29

(a) (b) (c)


0.5 0.4 0.2

0.4 0.15
0.3
0.3 0.1
0.2
0.2 0.05

0.1 0.1 0
0 50 100 0 50 100 0 200 400 600

0.2 0.2 0.35

0.15
true error 0.15 0.3
E inside
0.1 0.1 0.25

0.05 0.05 0.2

0 0 0.15
0 10 20 0 10 20 0 20 40 60
(d) (e) (f)

Figure 2.1: Numerical justication for using equation 2.4. The true and
approximate error (E-inside) as function of the number of features used for dierent
√ √
choices of µ: (a) µj = 1/ 2j , (b) µj = 1/j , (c) µj = 1/ j , (d) µj = 1, (e) µj = rand
(sorted) and (f ) µj = rand/j (sorted). The training set size here is m = 16. The true error
was estimated by averaging over 200 repeats. The approximation with the expectation
inside error term is very close to the actual error term (2.1), even for small training set
(m = 16).

completely non-informative feature µj (µj = 0) will decrease f (n, m). We can also formulate

a sucient condition for the situation where using too many features is harmful, and thus

feature selection can improve the accuracy dramatically:

∑n
Statement 1 For the above setting, if the partial sum series sn = j=1 µ2j < ∞ then for
any nite m the error of the ML classier approaches to 1/2 when n → ∞ and there is
n0 = n0 (m) < ∞ such that selecting the rst n0 features is superior to selecting k features
for any k > n0 .

Proof. Denote limn→∞ sn = s < ∞, then the numerator of (2.7) approaches s, while the

denominator approaches ∞, thus f (n, m) −−−−→ 0 and the error Φ (−f (n, m)) → 1/2. On
n→∞

the other hand f (n, m) > 0 for any nite n, thus there exists n0 such that f (n0 , m) >

f (k, m) for any k > n0 .


30 Chapter 2. Is Feature Selection Still Necessary?

Note that it is not possible to replace the condition in the above statement by µj −−−→ 0
j→∞
(see example 2.4). The following consistency statement gives a sucient condition on the

number of features that ensure asymptotic optimal error:

Statement 2 For the above setting, if we use a number of features n = n (m) that satises
−−−→ ∞ and (2)
(1) n −m→∞ n
m −−−−→ 0
m→∞
then the error of the ML estimator approaches to
the optimal possible error (i.e. the error when µ is known and we use all the features)
∑n
when m → ∞. Additionally, if j=1 µj −−−−→ ∞,
n→∞
condition (2) above can be replaced with
m −
n
−−−→ c, where c is any nite constant.
m→∞
( √∑ )

Proof. Recalling that the optimal possible error is given by Φ − j=1 µ j , the statement

follows directly from equation 2.7.

Corollary 2.1 Using the optimal number of features ensure consistency (in the sense of the
above statement).

Note as well that the eect of adding a feature depends not only on its value, but also

on the current value of the numerator and the denominator. In other words, the decision

whether to add a feature depends on the properties of the features we have added so far.

This may be surprising, as the features here are statistically independent. This apparent

dependency comes intuitively from the signal-to-noise ratio of the new feature to the existing

ones.

Another observation is that if all the features are equal, i.e. µj = c where c is a constant,

c2 √
f (n, m) = √ n
1
m + c2

and thus using all the features is always optimal. In this respect our situation is dierent

from the one analyzed by [54]. They considered Anderson's W classication statistic [5]

for the setting of two Gaussians with same covariance matrix, but both the mean and the

covariance were not known. For this setting they show that when all the features have

equal contributions, it is optimal to use m−1 features (where m is the number of training

examples).
2.3. Specic Choices of µ 31

2.3 Specic Choices of µ


Now we nd the optimal number of features for a few specic choices of µ.
( )
Example 2.1 Let µj = √1
2j
, i.e., µ = √1 , √1 , √1 , . . . , √1
2 4 8 2N
, thus kµk −n→∞
−−−→ 1. An

illustration of the density functions for this case is given in gure 2.2(a). Substituting this
µ in (2.7), we obtain:
1 − 21 2−n
f (n, m) = √
n
m + 1 − 21 2−n

Taking the derivative with respect to n and equating to zero we get2 :


( )
1 −2n −n n 1 1
− 2 ln 2 + 2 ln 2 1 + + − =0
4 m 2m m

Assuming n is large, we ignore the rst term and get:

1 n 1
m= 2 −n−
ln 2 2 ln 2
Ignoring the last two lower order terms, we have:
1 ln 2
√ ∼= √ n = (ln 2) µn
m 2

This makes sense as 1/ m is the standard deviation of µj , so the above equation says that
we only want to take features with a mean larger than the standard deviation. However, we
should note that this is true only in order of magnitude. No matter how small the rst
feature is, it is worth to take it. Thus, we have no hope to be able to nd optimal criterion

of the form: take feature j only if µj > f ( m).

Example 2.2 Let µj = 1/j . An illustration of the density functions for this case is given
∑ ( )2
in gure 2.2(b). Since ∞ = π6 ,
2
1
j=1 j

n ( )2
∑ ∫ ∞
1 π2 1 π2 1
> − 2
dx = −
j=1
j 6 n x 6 n

n ( )2
∑ ∫ ∞
1 π2 1 π2 1
< − dx = −
j=1
j 6 n+1 x2 6 n+1

2 The variable n is an integer, but we can consider f (n, m) to be dened for any real value.
32 Chapter 2. Is Feature Selection Still Necessary?

thus the lower bound π2


6 −n
1
approximates the nite sum up to n1 − n+1
1
= 1
n(n+1) . Substituting
this lower bound in (2.7), we obtain:
π2
− 1
f (n, m) ∼
=√ 6 n
n
m + π2
6 − 1
n

taking the derivative with respect to n and equating to zero yields m ∼


= n3 π 2 −18n2
nπ 2 −6 and thus

for a large n we obtain the power law m ∼
= n2 , or n ∼
= m = m 2 , and again we have
1

√1 ∼
= 1
= µn .
m n
( )
Example 2.3 Let µj = √1
j
, i.e µ = 1, √12 , . . . √1N . An illustration of the separation
∑n
between classes for a few choices of n is given in gure 2.2(c). Substituting j=1 µ2j ∼
= log n
in (2.7) we get:
log n
f (n, m) = √ n (2.8)
m + log n
taking the derivative with respect to n and equating to zero, we obtain:

n (log n − 2)
m∼
=
log n
thus for a large n we have m ∼
= n, and once again we have √1
m

= √1
n
= µn .
This example was already analyzed by Trunk ([113]) who showed that for any nite
number of training examples, the error approaches one half when the number of features
approaches ∞. Here we get this results easily from equation (2.8), as for any nite m,
f (n, m) −−−−→ 0, thus the error term Φ (−f (n, m)) approaches 1/2. on the other hand,
n→∞

when µ is known the error approaches zero when n increases, since kµk −n→∞
−−−→ ∞ while the

variance is xed. Thus, from corollary 2.1 we know that by using m = n the error approaches
to zero when m grows, and our experiments show that it drops very fast (for m ∼ 20 it is
already below 5% and for m ∼ 300 below 1%).

Example 2.4 In this example we show that the property µj −j→∞


−−→ 0 does not guarantee that
∑n
feature selection can improve classication accuracy. Dene sn = j=1 µ2j . Let µj be such
that sn = nα with 1
2 < α < 1. Since α < 1, it follows that indeed µj → 0. On the other
hand, by substituting in (2.7) we get:
nα nα/2
f (n, m) = √ n α
= √
m +n n1−α
+1 m
2.4. SVM Performance 33

n=1 n=3 n→ ∞

(a)

n=1 n=7 n→∞

(b)

n=1 n=30 n=106

(c)

Figure 2.2: Illustration of the separation between classes for dierent number of features
(n), for dierent choices of µ. The projection on the prex of µ of the density function of
each class and the combined density (in gray) are shown. (a) for example 2.1, (b) for
example 2.2 and (c) for example 2.3.

thus for n > mα−1 ,



nα/2 m
f (n, m) ≥ √ = nα−1/2
n1−α
2 m 2

and since α > 1/2, we have that f (n, m) −n→∞


−−−→ ∞ and thus using all the features is optimal.

We can see that the intuitive relation between the value of the smallest feature we want to
take and √1
m
that raised in the previous examples does not hold here. this demonstrate that
thinking on this value as proportional to √1
m
may be misleading.

2.4 SVM Performance


So far we have seen that the naive maximum likelihood classier is impeded by using too

many weak features, even when all the features are relevant and independent. However

it is worth testing whether a modern and sophisticated classier such as SVM that was

designed to work in very high dimensional spaces can overcome the peaking phenomenon.

For this purpose we tested SVM on the above two Gaussian setting in the following way.
34 Chapter 2. Is Feature Selection Still Necessary?

m = 16 m = 100 m = 225
0.5
0.25 0.24
0.4 0.22
0.3 0.2 0.2

0.2 0.18
0.16
0.1 0.15
0 50 100 0 50 100 0 50 100

0.5 0.16
0.18
0.4
0.16 0.14
0.3
0.14
0.12
0.2 0.12

0.1 0.1 0.1


0 50 100 0 50 100 0 50 100

0.4
0.15 0.15 SVM error
0.3 ML error
0.1 0.1 optimal error
0.2
SVM+scaling
0.1 0.05 0.05

0 0 0
0 500 1000 0 500 1000 0 500 1000

Figure 2.3: The error of SVM as a function of the number


√ of features
√ used. The
top, middle and bottom rows correspond to µj equals 1/ 2j , 1/j and 1/ j respectively.
The columns correspond to training set sizes of 16, 100 and 225. The SVM error was
estimated by averaging over 200 repeats. The graphs show that SVM produces the same
qualitative behavior as the ML classier we used in the analysis and that pre-scaling of the
features strengthen the eect of too many features on SVM.

We generated a training set with 1000 features, and trained linear


3 SVM on this training

set 1000 times, each time using a dierent number of features. Then we calculated the

generalization error of each returned classier analytically. The performance associated with

a given number of features n is the generalization error achieved using n features, averaged

over 200 repeats. We used the SVM tool-box by Gavin Cawley [18]. The parameter C was

tuned manually to be C = 0.0001, the value which is favorable to SVM when all (1000)

features are used. The results for the examples described in section 2.3 for three dierent

choices of training set size are presented in gure 2.3.

We can see that in this setting SVM suers from using too many features just like the

3 The best classier here is linear, thus linear kernel is expected to give best results.
2.4. SVM Performance 35

4
maximum likelihood classier . On the other hand, it is clear that in other situations SVM

does handle huge number of features well, otherwise it could not be used together with

kernels. Therefore, in order to understand why SVM fails here, we need to determine in

what way our high dimensional scenario is dierent from the one caused by using kernels.

The assumption which underlies the usage of the large margin principle, namely that the

density around the true separator is low, is violated in our rst example (example 2.1), but

not for the other two examples (see gure 2.2). Moreover, if we multiply the µ of example

2.1 by 2, then the assumption holds and the qualitative behavior of SVM does not change.

Hence this cannot be a major factor.

One might suggest that a simple pre-scaling of the features, such as dividing each features

5
by an approximation of its standard deviation , might help SVM. Such normalization is

useful in many problems, especially when dierent features are measured in dierent scales.

However, in our specic setting it is not likely that such normalization can improve the

accuracy, as it just suppress the more useful features. Indeed, our experiments shows (see

gure 2.3, dotted lines) that the pre-scaling strengthen the eect of too many features on

SVM.

One signicant dierence is that in our setting the features are statistically independent,

whereas when the dimension is high due to the usage of kernels, the features are highly

correlated. In other words, the use of kernels is equivalent to deterministic mapping of low

dimensional space to a high dimensional space. Thus there are many features, but the actual

dimension of the embedded manifold is low whereas in our setting the dimension is indeed

high. We ran one initial experiment which supported the assumption that this dierence

was signicant in causing the SVM behavior. We used SVM with a polynomial kernel of

increasing degrees in the above setting with µ = (1, 0), i.e. one relevant feature and one

irrelevant feature. The results are shown in gure 2.4. The accuracy declines as the degree

increases as expected (since the best model here is linear). However, the eect of the kernel

degree on the dierence in accuracy using one or both features is not signicant, despite the

4 This strengthen the observation in [46] (page 384) that additional noise features hurts SVM performance.
5 Given the class, the standard deviation is the same for all the features (equals to 1), but the overall
standard deviation is larger for the rst features, as in the rst coordinates the means of the two classes a
more well apart.
36 Chapter 2. Is Feature Selection Still Necessary?

0.4

using first feature


generalization error

0.3 using both features

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10
kernel degree

Figure 2.4: The eect of the degree of a polynomial kernel on SVM error
µ = (1, 0) and the training set size is m = 20. The results were averaged over 200 repeats.
The eect of the kernel degree on the dierence in accuracy using one or both features is
not signicant.

fact that the number of irrelevant features grows exponentially with the kernel degree .
6

2.5 Summary
We started this chapter by asking the question whether feature selection is still needed even

when we only interested in classication accuracy, or maybe modern classiers can handle

the presence of a huge number of features well and only lose from the split of the learning

process into two stages (feature selection and learning of a classier). Using a simple setting

of two spherical Gaussians it was shown that in some situations using too many features

results in error as bad as chance while a wise choice of the number of features results in

almost optimal error. We showed that the most prominent modern classier, SVM, does not

handle large numbers of weakly relevant features correctly and achieves suboptimal accuracy,

much like a naive classier. We suggest that the ability of SVM to work well in the presence

of huge number of features may be restricted to cases where the underlying distribution is

concentrated around a low dimensional manifold, which is the case when kernels are used.

However, this issue should be further investigated. Thus we conclude that feature selection

is indeed needed, even if nothing but classication accuracy interests us.

This chapter focused on feature selection, but the fundamental question addressed here

6 When a polynomial kernel of degree k is used, only k features are relevant whereas 2k − k are irrelevant.
2.5. Summary 37

is relevant to the broader question of general dimensionality reduction. In this setting one

looks for any conversion of the data to a low dimensional subspace that preserves the relevant

properties. Thus one should ask in what way the optimal dimension depends on the number

of training instances. We expect that a similar trade-o can be found.


Chapter 3

Margin Based Feature Selection1

1 The results presented in this chapter were rst presented in our NIPS02 paper titled Margin Analysis
of the LVQ algorithm [24], our ICML04 paper titled Margin Based Feature Selection [40] and a chapter
titled Large Margin Principles for Feature Selection in the book Feature extraction, foundations and
applications, edited by Guyon, I. and Gunn, S. and Nikravesh, M. and Zadeh, L. [41]

38
39

After describing the main concepts in feature selection and the prime rationales, (see

section 1.2 and chapter 2), this chapter discusses the ways feature selection is carried out.

New methods of feature selection for classication based on the maximum margin principle

are introduced. A margin [14, 100] is a geometric measure for evaluating the condence of a

classier with respect to its decision. Margins already play a crucial role in current machine

learning research. For instance, SVM [14] is a prominent large margin algorithm. The main

novelty of the results presented in this chapter is the use of large margin principles for feature

selection. Along the way we also present the new observation that there are two types of

margins (sample margin and hypothesis margin ), and dene these two types of margins for
the One Nearest Neighbor (1-NN) [32] algorithm. We also develop a theoretical reasoning

for the 20 year old LVQ [63] algorithm (see section 3.8)

Throughout this chapter the 1-NN is used as the study-case predictor, but most of the

results are relevant to other distance based classiers (e.g. LVQ [63], SVM-RBF [14]) as

well. To demonstrate this, we compare our algorithms to the R2W2 algorithm [120], which

was specically designed as a feature selection scheme for SVM. We show that even in this

setting our algorithms compare favorably to all the contesters in the running..

The use of margins allows us to devise new feature selection algorithms as well as prove

a PAC (Probably Approximately Correct) style generalization bound. The bound is on

the generalization accuracy of 1-NN on a selected set of features, and guarantees good

performance for any feature selection scheme which selects a small set of features while

keeping the margin large. On the algorithmic side, we use a margin based criterion to

measure the quality of sets of features. We present two new feature selection algorithms,

G-ip and Simba, based on this criterion. The merits of these algorithms are demonstrated

on various datasets. Finally, we study the Relief feature selection algorithm [61] in the large

margin context. While Relief does not explicitly maximize any evaluation function, it is

shown here that implicitly it maximizes the margin based evaluation function.

Before we present the main results, we briey review the key concepts in 1-NN and

margins.
40 Chapter 3. Margin Based Feature Selection

3.1 Nearest Neighbor classiers

Though fty years have passed since the introduction of One Nearest Neighbor (1-NN) [32]
it is still a popular algorithm. 1-NN is a simple and intuitive algorithm but at the same

time may achieve state of the art results [105]. During the training process 1-NN simply

stores the given training instances (also referred to as prototypes ) without any processing.

When it required to classify a new instance, it returns the label of the closest instance

in the training set. Thus, 1-NN assumes that there is a way to measure distance between

instances. If the instances are given as vectors in Rn , we can simply use a standard norm (e.g.

l2 norm). However in many cases using more sophisticated distance measures that capture

our knowledge of the problem may improve the results dramatically. It is also possible

to learn a good distance measure from the training set itself. Many distance measures

have been suggested for dierent problems in the past. One example of the usage of such

sophisticated distance measure is the tangent distance that was employed by [104] for digit

recognition. This distance measure takes into account the prior knowledge that the digit

identity is invariant to small transformations such as rotation, translation etc. Using 1-NN

together with this distance measure [104] achieved state-of-the-art results on the USPS data

set [69].

However in large, high dimensional data sets it often becomes unworkable. One approach

to cope with this computational problem is to approximate the nearest neighbor [52] using

various techniques. An alternative approach is to choose a small data-set (aka prototypes)

which represents the original training sample, and apply the nearest neighbor rule only

with respect to this small data-set. This solution maintains the spirit of the original

algorithm, while making it feasible. Moreover, it might improve the accuracy by reducing

noise overtting. We elaborate on these methods in section 3.8.

Another possible solution for noise overtting is to use k -NN, which considers the k

nearest neighbors and takes the majority (for classication) or the mean (for regression)

over the labels of the k nearest neighbors. Here it is also possible to weigh the eect of

each of the k neighbors according to its distance. 1-NN ensures error which is not more

than twice the optimal possible error in the limit of innite number of training instances
3.2. Margins 41

[22]. k -NN, when is used with k = k (m) (where m is the number of training instances) that

satises k −−−−→ ∞ and


k
m −−−−→ 0 ensures optimal error in the limit m→∞ [28]. In
m→∞ m→∞

this sense k -NN is consistent. See [112] for a recent review on Nearest Neighbor and ways

to approximate it.

In the following we will use 1-NN as the study case predictor in developing our feature

selection methods and will show how feature selection can improve the performance of 1-NN.

3.2 Margins
The concept of margins play an important role in current research in machine learning. A

margin measures the condence a classier when making its predictions. Margins are used

both for theoretic generalization bounds and as guidelines for algorithm design. Theoret-

ically speaking, margins are a main component in data dependent bounds [103, 9] on the

generalization error of a classier. Data dependent bounds use the properties of a specic

training data to give bounds which are tighter than the standard bounds that hold for any

training set and depend only on its size (and the classier properties). Since in many cases

the margins give a better prediction of the generalization error than the training error itself,

they can be used by learning algorithms as a better measure for a classier quality. We

can consider margins as a stronger and more delicate version of the training error. Two

of the most prominent algorithms in the eld, Support Vector Machines (SVM) [116] and

AdaBoost [33] are motivated and analyzed by margins. Since the introduction of these algo-

rithms dozens of papers have been published on dierent aspects of margins in supervised

learning [100, 81, 15].

The most common way to dene the margin of an instance with respect to a classication

rule is as the distance between the instance and the decision boundary induced by the

classication rule. SVM uses this kind of margin. We refer to this kind of margin as sample
margin. However, as we show in [24], an alternative denition, Hypothesis Margin, exists.

In this denition the margin is the distance that the classier can travel without changing

the way it labels a sample instance. Note that this denition requires a distance measure

between classiers. This type of margin is used in AdaBoost [33]. The dierence between
42 Chapter 3. Margin Based Feature Selection

(a) (b)

Figure 3.1: Illustration of the two types of margins for homogeneous linear classiers with
angle as distance measure between classiers. (a) sample margin measures how much an
instance can travel before it hits the decision boundary. (b) on the other hand a
hypothesis margin measures how much the hypothesis can travel before it hits an
instance.

these two denitions of margin is illustrated in gure 3.1.

While the margins themselves are a non-negative, the standard convention is to add a

sign to them that indicates whether the instance was correctlly calssied, where a negative

sign indicates wrong classication. We use the terms signed margin or unsigned margin to

indicate whether the sign is considered or not, but this prex is omitted in the following

whenever it is clear from the context. Note that the signed margin depends on the true label

and therefore is only available during the training stage.

3.3 Margins for 1-NN


Throughout this chapter we will be interested in margins for 1-NN. As described in sec-

tion 3.1, a 1-NN classier is dened by a set of training points (prototypes) and the decision

boundary is the Voronoi tessellation. The sample margin in this case is the distance between

the instance and the Voronoi tessellation, and therefore it measures the sensitivity to small
3.3. Margins for 1-NN 43

changes of the instance position. We dene the hypothesis margin for 1-NN as follows:

Denition 3.1 The (unsigned) hypothesis margin for 1-NN with respect to an instance x
is the maximal distance θ such that the following condition holds: if we draw a ball with
radius θ around each prototype, any change of the location of prototypes inside their θ ball
will not change the assigned label of the instance x.

Therefore, the hypothesis margin measures stability with respect to small changes in the

prototype locations. See gure 3.2 for illustration. The sample margin for 1-NN can be

unstable since small relocations of the prototypes might lead to a dramatic change in the

sample margin. Thus the hypothesis margin is preferable in this case. Furthermore, the

hypothesis margin is easy to compute (lemma 3.1) and lower bounds the sample-margin

(lemma 3.2).

Lemma 3.1 Let ν = (ν1 , . . . , νk ) be a set of prototypes. Let x be an instance then the
(unsigned) hypothesis margin of ν with respect to x is θ = 12 (kνj − xk − kνi − xk) where νi
is the closest prototype to x with the same label as x and νj is the closest prototype with
alternative label.

The proof of this lemma is straightforward from the denition.

Lemma 3.2 Let x be an instance and ν = (ν1 , . . . , νk ) be a set of prototypes. Then the (un-
signed) sample margin of x with respect to ν is greater or equal to the (unsigned) hypothesis
margin of ν with respect to x.

Proof. Let θ be larger than the sample margin, i.e. there exists x̂ such that kx − x̂k ≤ θ but

x and x̂ are labeled dierently by ν . Since they are labeled dierently, the closest prototype

to x is dierent than the closest prototype to x̂. W.L.G let ν1 be the closest prototype to x

and ν2 be the closest prototype to x̂.

Let w = x − x̂ then kwk ≤ θ. Dene ∀j ν̂j = νj + w. We claim that ν̂2 is the closest

prototype to x in ν̂ . This follows since

kx − ν̂j k = kx − (νj + w)k = kx̂ − νj k


44 Chapter 3. Margin Based Feature Selection

(a) (b)

Figure 3.2: The two types of margins for the Nearest Neighbor rule. We consider a set of
prototypes (the circles) and measure the margin with respect to an instance (the square).
(a) the (unsigned) sample margin is the distance between the instance and the decision
boundary (the Voronoi tessellation). (b) the (unsigned) hypothesis margin is the largest
distance the prototypes can travel without altering the label of the new instance. In this
case it is half the dierence between the distance to the nearmiss and the distance to the
nearhit of the instance. Note that in this example the white prototype governs the
margin is not the same for both types of margins: for the hypothesis margin it is the one
closest to the new instance, whereas for the sample margin it is the one that denes the
relevant segment of the Voronoi tessellation.

Hence ν̂ assigns x a dierent label and also no prototype traveled a distance which is larger

than θ, thus the hypothesis margin is less than θ.

Since this result holds for any θ greater than the sample margin of ν we conclude that

the hypothesis margin of ν is less than or equal to its sample margin.

Lemma 3.2 shows that if we nd a set of prototypes with a large hypothesis margin then

it has a large sample margin as well.

3.4 Margin Based Evaluation Function


Our selection algorithm works on the most common selection paradigm that consists of an

evaluation function which assigns a score to a given subset of features and a searches method

that search for a set with a high score (see section 1.2.1 for more details on this paradigm).

In this section we introduce our margin based evaluation function.

A good generalization can be guaranteed if many instances have a large margin (see
3.4. Margin Based Evaluation Function 45

section 3.6). We introduce an evaluation function, which assigns a score to sets of features

according to the margin they induce. First we formulate the margin as a function of the

selected set of features.

Denition 3.2 Let P be a set of prototypes and x be an instance. Let w be a weight vector
over the feature set, then the margin of x is
1
θPw (x) = (kx − nearmiss (x)kw − kx − nearhit (x)kw ) (3.1)
2
√∑
where kzkw = i wi2 zi2 , nearhit (x) is the closest prototype to x with same label as x
and nearmiss (x) is the closest prototype to x with alternative label

Denition 3.2 extends beyond feature selection and allows weight over the features. When

selecting a set of features F we can use the same denition by identifying F with its indicating

vector. Therefore, we denote by θPF (x) := θPIF (x) where IF is one for any feature in F and

zero otherwise.

Since θλw (x) = |λ|θw (x) for any scalar λ, it is natural to introduce some normalization

factor. The natural normalization is to require max wi2 = 1, since it guarantees that kzkw ≤

kzk where the right hand side is the Euclidean norm of z.

Now we turn to dening the evaluation function. The building blocks of this function are

the margins of all the instances. The margin of each instance x is calculated with respect

to the sample excluding x (leave-one-out margin).

Denition 3.3 Let u(·) be a utility function. Given a training set S and a weight vector
w, the evaluation function is:
∑ ( )
w
e(w) = u θS\x (x) (3.2)
x∈S

The utility function controls the contribution of each margin term to the overall score. It

is natural to require the utility function to be non-decreasing; thus larger margin introduce

larger utility. We consider three utility functions: linear, zero-one and sigmoid. The linear

utility function is dened as u(θ) = θ. When the linear utility function is used, the evalu-

ation function is simply the sum of the margins. The zero-one utility is equals 1 when the

margin is positive and 0 otherwise. When this utility function is used the utility function
46 Chapter 3. Margin Based Feature Selection

is proportional to the leave-one-out error. The sigmoid utility is u(θ) = 1/(1 + exp(−βθ)).

The sigmoid utility function is less sensitive to outliers than the linear utility, but does not

ignore the magnitude of the margin completely as the zero-one utility does. Note also that

for β→0 or β→∞ the sigmoid utility function becomes the linear utility function or the

zero-one utility function respectively. In the Simba algorithm we assume that the utility

function is dierentiable, and therefore the zero-one utility cannot be used.

It is natural to look at the evaluation function solely for weight vectors w such that

max wi2 = 1. However, formally, the evaluation function is well dened for any w, a fact

which we make use of in the Simba algorithm. We also use the notation e(F ), where F is a

set of features to denote e(IF ).

3.5 Algorithms
In this section we present two algorithms which attempt to maximize the margin based

evaluation function. Both algorithms can cope with multi-class problems. Our algorithms

can be considered as lter methods for general classiers. They also have much in common

with wrappers for 1-NN. A Matlab implementation of these algorithms is available at http:

//www.cs.huji.ac.il/labs/learning/code/feature_selection/. I also wrote a Matlab GUI

application that was designed to make it easy to perform feature selection and assess the

eect of the selection on classication accuracy. The tool is called Feature Selection Tool
(FST) and can be downloaded from http://www.cs.huji.ac.il/~anavot/feature_selection_

tool/fst.htm. The tool is mainly aimed at make it easier for researchers who are not

programmers to use and evaluate feature selection on their data. The tool supports the

selection algorithms which are presented in this chapter and some other related or classic

selection algorithms. The classication algorithms which are currently supported are 1-

Nearest-Neighbors, Naive Bayes and SVM.

3.5.1 Greedy Feature Flip Algorithm (G-ip)


The G-ip (algorithm 1) is a greedy search algorithm for maximizing e(F ), where F is a set of

features. The algorithm repeatedly iterates over the feature set and updates the set of chosen
3.5. Algorithms 47

Algorithm 1 Greedy Feature Flip (G-ip)


1. Initialize the set of chosen features to the empty set: F =φ

2. for t = 1, 2, . . .

(a) pick a random permutation s of {1 . . . N }


(b) for i=1 to N,
i. evaluate e1 = e (F ∪ {s(i)}) and e2 = e (F \ {s(i)})
ii. if e1 > e2 , F = F ∪ {s(i)}
else-if e2 > e1 , F = F \ {s(i)}

(c) if no change made in step (b) then break

features. In each iteration it decides to remove or add the current feature to the selected

set by evaluating the margin term (3.2) with and without this feature. This algorithm is

similar to the zero-temperature Monte-Carlo (Metropolis) method. It converges to a local

maximum of the evaluation function, as each step increases its value and the number of

possible feature sets is nite. The computational complexity of one pass over all features of
( )
a naive implementation of G-ip is Θ N 2 m2 where N m is
is the number of features and
( 2
)
the number of instances. However the complexity can be reduced to Θ N m since updating

the distance matrix can be done eciently after each addition/deletion of a feature from the

current active set. Empirically G-ip converges in a few iterations. In all our experiments

it converged after less than 20 epochs, in most of the cases in less than 10 epochs. A nice

property of this algorithm is that once the utility function is chosen, it is parameter free.
There is no need to tune the number of features or any type of threshold.

3.5.2 Iterative Search Margin Based Algorithm (Simba)


The G-ip algorithm presented in section 3.5.1 tries to nd the feature set that maximizes

the margin directly. Here we take another approach. We rst nd the weight vector w that

maximizes e(w) as dened in (3.2) and then use a threshold in order to get a feature set. Of

course, it is also possible to use the weights directly by using the induced distance measure

instead. Since e(w) is smooth almost everywhere, whenever the utility function is smooth,

we use gradient ascent in order to maximize it. The gradient of e(w) when evaluated on a
48 Chapter 3. Margin Based Feature Selection

Algorithm 2 Simba
1. initialize w = (1, 1, . . . , 1)

2. for t = 1...T

(a) pick randomly an instance x from S


(b) calculate nearmiss (x) and nearhit (x) with respect to S \ {x} and the weight
vector w.
(c) for i = 1, . . . , N
calculate
( )
(xi − nearmiss (x)i ) (xi − nearhit (x)i )
2 2
1 ∂u(θ(x))
4i = − wi
2 ∂θ(x) kx − nearmiss (x)kw kx − nearhit (x)kw

(d)w =w+4
° °
3. w ← w2 / °w2 °∞ where (w2 )i := (wi )2 .

sample S is:

∂e(w) ∑ ∂u(θ(x)) ∂θ(x)


(5e(w))i = = (3.3)
∂wi ∂θ(x) ∂wi
x∈S
( )
1 ∑ ∂u(θ(x)) (xi − nearmiss (x)i ) (xi − nearhit (x)i )
2 2
= − wi
2 ∂θ(x) kx − nearmiss (x)kw kx − nearhit (x)kw
x∈S

In Simba (algorithm 2) we use a stochastic gradient ascent over e(w) while ignoring the

constraint kw2 k∞ = 1. In each step we evaluate only one term in the sum in (3.3) and add

it to the weight vector w. The projection on the constraint is done only at the end (step 3).

The computational complexity of Simba is Θ(T N m) where T is the number of iterations,

N is the number of features and m is the size of the sample S. Note that when iterating
( )
over all training instances, i.e. when T = m, the complexity is Θ N m2 .

3.5.3 Comparison to Relief


Relief [61] (algorithm 3) is a feature selection algorithm, which was shown to be very e-

cient for estimating feature quality. The algorithm holds a weight vector over all features

and updates this vector according to the instances presented. [61] proved that under some

assumptions, the expected weight is large for relevant features and small for irrelevant ones.

They also explain how to choose the relevance threshold τ in a way that ensures the prob-
3.5. Algorithms 49

Algorithm 3 RELIEF [61]


1. initiate the weight vector to zero: w=0
2. for t = 1...T,

(a) pick randomly an instance x from S


(b) for i = 1 . . . N,
wi = wi + (xi − nearmiss (x)i ) − (xi − nearhit (x)i )
2 2
i.

3. the chosen feature set is {i|wi > τ } where τ is a threshold

ability that a given irrelevant feature chosen is small. Relief was extended to deal with

multi-class problems, noise and missing data by [65]. For multi-class problems [65] also

presents a version called Relief-F that instead of using the distance to the nearest point

with an alternative label, looks at the distances to the nearest instance of any alterna-

tive class and takes the average. In the experiments we made Relief-F was inferior to the

standard Relief.
Note that the update rule in a single step of Relief is similar to the one performed by

Simba when the utility function is linear, i.e. u(θ) = θ and thus ∂u(θ)/∂θ = 1. Indeed,

empirical evidence shows that Relief does increase the margin (see section 3.7). However,

there is a major dierence between Relief and Simba : Relief does not re-evaluate the dis-

tances according to the weight vector w and thus it is inferior to Simba. In particular, Relief
has no mechanism for eliminating redundant features. Simba may also choose correlated

features, but only if this contributes to the overall performance. In terms of computational

complexity, Relief and Simba are equivalent.

3.5.4 Comparison to R2W2


R2W2 [120] is the state-of-the-art feature selection algorithm for the Support Vector Ma-
chines (SVM) classier. This algorithm is a sophisticated wrapper for SVM and therefore

uses the maximal margin principle for feature selection indirectly. The goal of the algorithm

is to nd a weights vector over the features, which will minimize the objective function of

the SVM optimization problem. This objective function can be written as R2 W 2 where
50 Chapter 3. Margin Based Feature Selection

R is the radius of a ball containing all the training data and W is the norm of the linear

separator. The optimization is done using gradient descent. After each gradient step a new

SVM optimization problem is constructed and solved. Thus it becomes cumbersome for

large scale data.

The derivation of R2W2 algorithm assumes that the data are linearly separable. Since

this cannot be guaranteed in the general case we use the ridge trick of adding a constant

value to the diagonal of the kernel matrix. Note also that R2W2 is designed for binary

classication tasks only. There are several ways in which it can be extended to multi class

problems. However, these extensions will make the algorithm even more demanding than

its original version.

As in SVM, R2W2 can be used together with a kernel function. We chose to use the

Radial Basis Function (RBF) kernel. The RBF kernel is dened to be

kx1 −x2 k
K(x1 , x2 ) = e− 2σ 2

where σ is a predened parameter. The choice of the RBF kernel is due to the similarity

between SVM with RBF kernel and the nearest-neighbor rule. Our implementation is based

on the one in the Spider package [119].

3.6 Theoretical Analysis


In this section we use feature selection and large margin principles to prove nite sample

generalization bounds for One Nearest Neighbor (1-NN), combined with the preceding step
of feature selection. [22] showed that asymptotically the generalization error of 1-NN can

exceed the generalization error of the Bayes optimal classication rule by at most a factor of

2. However, on nite samples, nearest neighbor can over-t and exhibit poor performance.

1-NN gives zero training error, on almost any sample. Thus the training error is too

rough to provide information on the generalization performance of 1-NN. We therefore need

a more detailed measure to provide meaningful generalization bounds and this is where

margins become useful. It turns out that in a sense 1-NN is a maximum margin algorithm.

Once our proper denition of margin is used, i.e. sample-margin, it is easy to verify that
3.6. Theoretical Analysis 51

1-NN generates the classication rule with the largest possible margin.

The combination of a large margin and a small number of features provides enough

evidence to obtain a useful bound on the generalization error. The bound we provide here is

data-dependent [103, 9]. Therefore, the value of the bound depends on the specic sample.

The bound holds simultaneously for any possible method to select a set of features. Thus, if

a selection algorithm selects a small set of features with a large margin, the bound guarantees

that 1-NN that is based on this set of features will generalize well. This is the theoretical

rationale behind Simba and G-ip.


We use the following notation in our theoretical results:

Denition 3.4 Let D be a distribution over X × {±1} and h : X −→ {±1} a classication


function. We denote by erD (h) the generalization error of h with respect to D:

erD (h) = Pr [h(x) 6= y]


x,y∼D

For a sample S = {(xk , yk )}m


k=1 ∈ (X × {±1})
m
and a constant γ > 0 we dene the γ -
sensitive training error to be
1 ¯¯{ }¯
er
ˆ γS (h) = ¯ (k : h(xk ) 6= yk ) or (xk has sample-margin < γ) ¯¯
m

Our main result is the following theorem :


2

Theorem 3.1 Let D be a distribution over RN ×{±1} which is supported on a ball of radius
R in RN . Let δ > 0 and let S be a sample of size m such that S ∼ Dm . With probability
1−δ over the random choice of S , for any set of features F and any γ ∈ (0, 1]
√ ( ( ) ( ) )
2 34em 8
erD (h) ≤ er (h) +
ˆ γS d ln log2 (578m) + ln + (|F | + 1) ln N
m d γδ

Where h is the nearest neighbor classication rule when distance is measured only on the
features in F and d = (64R/γ)|F | .

A few notes about this bound; First the size of the feature space, N, appears only

logarithmically in the bound. Hence, it has a minor eect on the generalization error of 1-

NN. On the other hand, the number of selected features, F, appears in the exponent. This

2 Note that the theorem holds when sample-margin is replaced by hypothesis-margin since the later lower
bounds the former.
52 Chapter 3. Margin Based Feature Selection

is another realization of the curse of dimensionality [12]. See appendix A for the proof of

theorem 3.1.

A large margin for many instances will make the rst term of the bound small, while

using a small set of features will make the second term of the bound small. This gives us

the motivation to look for small sets of features that induce large margin, and that is what

G-ip and Simba do. As this bound is a worst case bound, like all the PAC style bounds, it

is very loose in most of the cases, and the empirical results are expected to be much better.

3.7 Empirical Assessment


We rst demonstrate the behavior of Simba on a small synthetic problem. Then we compare

the dierent algorithms on image and text classication tasks. The rst task is pixel (feature)

selection for discriminating between male and female face images. The second task is a

word (feature) selection for multi-class document categorization. In order to demonstrate

the ability of our algorithms to work with other classiers (beside of Nearest Neighbor) we

also report results with SVM with RBF kernel (see section 3.7.4). We also report the results

obtained on some of the datasets of the NIPS-2003 feature selection challenge [44]. For these

comparisons we have used the following feature selection algorithms: Simba with both linear
and sigmoid utility functions (referred as Simba(lin) and Simba(sig) respectively), G-ip
with linear, zero-one and sigmoid utility functions (referred as G-ip(lin), G-ip(zero-one)
and G-ip(sig) respectively), Relief, R2W2 and Infogain 3 .

3.7.1 The Xor Problem


To demonstrate the quality of the margin based evaluation function and the ability of the

Simba algorithm4 to deal with dependent features we use a synthetic problem. The problem

consisted of 1000 instances with 10 real valued features. The target concept is a xor function

over the rst 3 features. Hence, the rst 3 features are relevant while the other features are

irrelevant. Note that this task is a special case of parity function learning and is considered

3 Recall that Infogain ranks features according to the mutual information between each feature and the
labels (see chapter 1, section 1.2.1)
4 The linear utility function was used in this experiment.
3.7. Empirical Assessment 53

Figure 3.3: The results of applying Simba (solid) and Relief (dotted) on the xor synthetic
problem. Top: The margin value, e(w), at each iteration. The dashed line is the margin
of the correct weight vector. Bottom: the angle between the weight vector and the
correct feature vector at each iteration (in Radians).

hard for many feature selection algorithms [43]. Thus for example, any algorithm which does

not consider functional dependencies between features fails on this task. The simplicity

(some might say over-simplicity) of this problem, allows us to demonstrate some of the

interesting properties of the algorithms studied.

Figures 3.3 present the results we obtained on this problem. A few phenomena are

apparent in these results. The value of the margin evaluation function is highly correlated

with the angle between the weight vector and the correct feature vector (see gures 3.3

and 3.4). This correlation demonstrates that the margins characterize correctly the quality

of the weight vector. This is quite remarkable since our margin evaluation function can be

measured empirically on the training data whereas the angle to the correct feature vector is

unknown during learning.

As suggested in section 3.5.3 Relief does increase the margin as well. However, Simba,
which maximizes the margin directly, outperforms Relief quite signicantly. as shown in

gure 3.3.
54 Chapter 3. Margin Based Feature Selection

Figure 3.4: The scatter plot shows the angle to the correct feature vector as function of the
value of the margin evaluation function. The values were calculated for the xor problem
using Simba during iterations 150 to 1000. Note the linear relation between the two
quantities.

Figure 3.5: Excerpts from the face images dataset.

3.7.2 Face Images


We applied the algorithms to the AR face database [80], which is a collection of digital

images of males and females with various facial expressions, illumination conditions, and

occlusions. We selected 1456 images and converted them to gray-scale images of 85 × 60

pixels, which are taken as our initial 5100 features. Examples of the images are shown in

gure 3.5. The task we tested is classifying the male vs. the female faces.

In order to improve the statistical signicance of the results, the dataset was partitioned

independently 20 times into training data of 1000 images and test data of 456 images. For

each such partitioning (split) Simba 5 , G-ip, Relief, R2W2 and Infogain were applied to

select optimal features and the 1-NN algorithm was used to classify the test data points. We

used 10 random starting points for Simba (i.e. random permutations of the train data) and

selected the result of the single run which reached the highest value of the evaluation function.

5 Simba was applied with both linear and sigmoid utility functions. We used β = 0.01 for the sigmoid
utility.
3.7. Empirical Assessment 55

The average accuracy versus the number of features chosen, is presented in gure 3.6. G-
ip gives only one point on this plot, as it chooses the number of features automatically,

although this number changes between dierent splits, it does not change signicantly.

The features G-ip (zero-one) selected enabled 1-NN to achieve accuracy of 92.2% using
about 60 features only, which is better than the accuracy obtained with the whole feature

set (91.5%). G-ip (zero-one) outperformed any other alternative when only few dozens

of features were used. Simba(lin) signicantly outperformed Relief, R2W2 and Infogain,
especially in the small number of features regime. If we dene a dierence as signicant if

the one algorithm is better than the other in more than 90% of the partition, we can see

that Simba is signicantly better than Relief, Infogain and R2W2 when fewer than couple of

hundreds of features are being used (see gure 3.7). Moreover, the 1000 features that Simba
selected enabled 1-NN to achieve an accuracy of 92.8%, which is better than the accuracy

obtained with the whole feature set (91.5%).

A closer look on the features selected by Simba and Relief (gure 3.8) reveals the clear

dierence between the two algorithms. Relief focused on the hair-line, especially around

the neck, and on other contour areas in a left-right symmetric fashion. This choice is

suboptimal as those features are highly correlated to each other and therefore a smaller

subset is sucient. Simba on the other hand selected features in other informative facial

locations but mostly on one side (left) of the face, as the other side is clearly highly correlated

and does not contribute new information to this task. Moreover, this dataset is biased in

the sense that more faces are illuminated from the right. Many of them are saturated and

thus Simba preferred the left side over the less informative right side.

3.7.3 Reuters
We applied the dierent algorithms on a multi-class text categorization task. For these

purpose we used a subset of the Reuters-21578 dataset .


6 We have used the documents,

which are classied to exactly one of the following 4 topics: interest, trade, crude and grain.
The obtained dataset contains 2066 documents, which are approximately equally distributed

between the four classes. Each document was represented as the vector of counts of the

6 The dataset can be found at http://www.daviddlewis.com/resources/


56 Chapter 3. Margin Based Feature Selection

Figure 3.6: Results for AR faces dataset. The accuracy achieved on the AR faces
dataset when using the features chosen by the dierent algorithms. The results were
averaged over the 20 splits of the dataset. For the sake of visual clarity, error bars are
presented separately in gure 3.7.
3.7. Empirical Assessment 57

Figure 3.7: Error intervals for AR faces dataset. The accuracy achieved on the AR
faces dataset when using the features chosen by the dierent algorithms. The error
intervals show the area were 90% of the results (of the 20 repeats) fell, i.e., the range of the
results after eliminating the best and the worse iterations out of the 20 repeats.

(a) (b) (c) (d) (e) (f )

Figure 3.8: The features selected (in black) by Simba(lin) and Relief for the face
recognition task. 3.8(a), 3.8(b) and 3.8(c) shows 100, 500 and 1000 features selected by
Simba. 3.8(d), 3.8(e) and 3.8(f ) shows 100, 500 and 1000 features selected by Relief.
58 Chapter 3. Margin Based Feature Selection

dierent words. Stop-words were omitted and numbers were converted to a predened

special character as a preprocessing.

To improve the statistical signicance of our results, the corpus was partitioned 20 times

into training set of 1000 documents and test set of 1066 documents. For each such partition-

ing, the words that appear less than 3 times in the training set were eliminated, which left

∼4000 words (features). For each partitioning G-ip, Simba, Relief, Relief-F and Infogain
were applied to select an optimal set of features
7 and the 1-NN was used to classify the test

documents. Simba and G-ip were applied with both linear and sigmoid utility functions .
8

G-ip was also applied with the zero-one utility function. We have used 10 random starting

points for Simba and selected the results of the single run that achieved the highest value

for the evaluation function. The average (over the 20 splits) accuracy versus the number

of chosen features is presented in gure 3.9. G-ip gives only one point on this plot, as it

chooses the number of features automatically, although this number changes between dier-

ent splits, it does not change signicantly. Another look on the results is given in gure 3.10.

This gure shows error intervals around the average that allow appreciating the statistical

signicance of the dierences in the accuracy. The top ranked twenty features are presented

in table 3.1.

The best overall accuracy (i.e. when ignoring the number of features used) was achieved

by G-ip(sig). G-ip(sig) got 94.09% generalization accuracy using ∼350 features. Infogain
and Simba(sig) are just a little behind with 92.86% and 92.41% , that was achieved using

∼40 and ∼30 features only (respectively). Relief is far behind with 87.92% that was achieved

using 250 features.

The advantage of Simba(sig) is very clear when looking on the accuracy versus the

number of features used. Among the algorithms that were tested, Simba(sig) is the only

algorithm that achieved (almost) best accuracy over the whole range. Indeed, when less

than few dozens of features are used, Infogain, achieved similar accuracy, but when more

features are used, it's accuracy drops dramatically. While Simba(sig) achieved accuracy

above 90% for any number of features between ∼20 and ∼3000, the accuracy achieved by

7 R2W2 was not applied for this problem as it is not dened for multi-class problems.
8 The sigmoid utility function was used with β = 1 for both G-ip and Simba.
3.7. Empirical Assessment 59

Simba(sig) Simba(lin) Relief Infogain


oil rate oil corn ## s oil trade
wheat bpd ## brazil # bank tonnes wheat
trade days tonnes wish ### rate crude bank
tonnes crude trade rebate oil dlrs rate agriculture
corn japanese wheat water trade barrels grain barrels
bank decit bank stocks #### rates petroleum rates
rice surplus rice certicates billion bpd pct corn
grain indonesia rates recession opec pct taris japan
rates interest billion got tonnes barrel mt decit
ec s pct ve wheat cts energy bpd

Table 3.1: The rst 20 words (features) selected by the dierent algorithms for the Reuters
dataset. Note that Simba(sig) is the only algorithm which selected the titles of all four
classes (interest, trade, crude, grain) among the rst twenty features

Infogain dropped below 90% when more than ∼400 features are used, and below 85% when

more than ∼1500 are used.

The advantage of Simba(sig) over relief is also very clear. The accuracy achieved using

the features chosen by Simba(sig) is notably and signicantly better than the one achieved

using the features chosen by Relief, for any number of selected features. Using only the top

10 features of Simba(sig) yields accuracy of 89.21%, which is about the same as the accuracy
that can be achieved by any number of features chosen by Relief.

3.7.4 Face Images with Support Vector Machines


In this section we show that our algorithms work well also when using another distance

based classier, instead of the Nearest Neighbor classier. We test the dierent algorithms

together with SVM with RBF kernel classier and show that our algorithms works as good

as, and even better than the R2W2 algorithm that was tailored specically for this setting

and is much more computationally demanding.

We have used the AR face database [80]. We have repeated the same experiment as

described in section 3.7.2, the only dierence being that once the features were chosen, we

used SVM-RBF to classify the test data points. The sigma parameter used in the RBF

kernel was selected to be 3500, the same parameter used in the R2W2 feature selection

algorithm. The value for this parameter was tuned using cross-validation. See gure 3.11
60 Chapter 3. Margin Based Feature Selection

100
95
90
85
Accuracy (%)

80
G−flip (lin)
75
G−flip (zero−one)
70 G−flip (sig)
Simba (lin)
65
Simba (sig)
60 Relief
Relief−F
55
Infogain
50

1 2 3
10 10 10
Number of Selected Features

Figure 3.9: Results for Reuters dataset. The accuracy achieved on the Reuters dataset
when using the features chosen by the dierent algorithms. The results were averaged over
the 20 splits of the dataset. For the sake of visual clarity, error bars are presented
separately in gure 3.10.
3.7. Empirical Assessment 61

100
Accuracy (%)

90

80
Relief
70 Simba (lin)
60 Simba (sig)
Infogain
50
10 50 100 500
Number of Selected Features

Figure 3.10: Error intervals for Reuters dataset. The accuracy achieved on the
Reuters dataset when using the features chosen by the dierent algorithms. The error
intervals show the area were 90% of the results (of the 20 repeats) fell, i.e., the range of the
results after eliminating the best and the worse iterations out of the 20 repeats.
62 Chapter 3. Margin Based Feature Selection

Figure 3.11: Results for AR faces dataset with SVM-RBF classier. The accuracy
achieved on the AR faces dataset when using the features chosen by the dierent
algorithms and SVM-RBF classier. The results were averaged over the 20 splits of the
dataset

for a summary of the results.

Both Simba and G-ip perform well, especially in the small number of features regime.

The results in the graph are the average over the 20 partitions of the data. Note that the

only winnings which are 90% signicant are those of Simba (lin) and G-ip (zero-one) when
only few dozens of features are used.

3.7.5 The NIPS-03 Feature Selection Challenge


The NIPS-03 feature selection challenge [44] was the prime motivator to develop the selection
algorithms presented in this chapter. The rst versions of G-ip were developed during the

challenge and Simba was developed following the challenge. We applied G-ip(lin) as part
3.8. Relation to Learning Vector Quantization 63

of our experiments in the challenge. The two datasets on which we used G-ip are ARCENE
and MADELON.
In ARCENE we rst used Principal Component Analysis (PCA) as a preprocessing and

then applied G-ip to select the principal components to be used for classication. G-ip
selected only 76 features. These features were fed to a Support Vector Machine (SVM) with

a RBF kernel and yielded a 12.66% balanced error (the best result on this dataset was a

10.76% balanced error).

In MADELON we did not apply any preprocessing. G-ip selected only 18 out of the 500
features in this dataset. Feeding these features to SVM-RBF resulted in a 7.61% balanced

error, while the best result on this dataset was a 6.22% error.

Although G-ip is a very simple and naive algorithm, it ranks as one of the leading

feature selection methods on both ARCENE and MADELON. It is interesting to note that
when 1-NN is used as the classication rule instead of SVM-RBF the error degrades only by

∼1% on both datasets. However, on the other datasets of the feature selection challenge, we

did not use G-ip either due to its computational requirements or due to poor performance.
Note that we tried only the linear utility function for G-ip and did not try Simba on any

of the challenge datasets, since it was developed after the challenged ended.

3.8 Relation to Learning Vector Quantization


As already mentioned in section 3.1, a possible way to improve the computational complexity

of 1-NN and avoid noise overtting for large high dimensional datasets is by replacing the

training set with a small number of prototypes. The goal of the training stage in this context

is to nd good small set of prototypes, i.e. a set that induces a small generalization error.

The most prominent algorithm for this task is the 20 year old Learning Vector Quantization
(LVQ) algorithm [63]. Several variants of LVQ have been presented in the past, but all of

them share the following common scheme. The algorithm maintains a set of prototypes,

ν = (ν1 , . . . , νk ), each is assigned with a predened label, which is kept constant during the

learning process. It cycles through the training data S m = {(xl , yl )}m


l=1 and in each iteration

modies the set of prototypes in accordance with one instance (xt , yt ). If the prototype νj
64 Chapter 3. Margin Based Feature Selection

has the same label as yt it is attracted to xt but if the label of νj is dierent it is repelled

from it. Hence LVQ updates the closest prototypes to xt according to the rule:

νj ← νj ± αt (xt − νj ) , (3.4)

where the sign is positive if the label of xt and νj agree, and negative otherwise. The

parameter αt is updated using a predened scheme and controls the rate of convergence of

the algorithm. The variants of LVQ dier as to which prototypes they choose to update in

each iteration and as regards the specic scheme used to modify αt .

LVQ was presented as a heuristic algorithm. However, in [24] we show that it emerges

naturally from the margin based evaluation function presented in section 3.4. Up to very

minor dierences LVQ is a stochastic gradient ascent over this evaluation function, and the

dierent variants of LVQ correspond to dierent choices of utility function. In this sense

LVQ is dual to our margin-based selection algorithms. Both algorithms optimize the same

target function, but with respect to dierent parameters. In the selection algorithms the

position of prototypes is xed and the free parameter is the set of selected features, whereas

in LVQ the set of features is xed and the free parameters are the positions of the prototypes.

This new point of view on the veteran LVQ as a maximum margin algorithm enables us to

derive margin-based bounds on the generalization error of LVQ. This bound is similar in its

form to the bound presented in section 3.6. We do not go into detail in this section, as it

not directly related to feature selection and thus goes beyond the scope of this dissertation.

The details can be found in [24]. However, another nice observation is the following: SVM

looks for the linear classier with the maximal minimum margin over the training set. On

the other hand, among all the possible classiers, 1-NN has the maximal minimum margin

over the training set, but it has a very complicated structure. We show that LVQ looks for

the classier with maximal minimum margin for a given number of prototypes k. Recalling

that for k = 2 we get a linear classier, and for k =#instances we get 1-NN, we can consider

LVQ as a family of large margin classiers with a parameter k that controls the complexity

of the hypothesis class.


3.9. Summary and Discussion 65

3.9 Summary and Discussion


In this chapter, a margin-based criterion for measuring the quality of a set of features has

been presented. Using this criterion we derived algorithms that perform feature selection by

searching for the set that maximizes it. We suggested two new methods for maximizing the

margin based-measure, G-ip, which does a naive local search, and Simba, which performs
a gradient ascent. These are just some representatives of the variety of optimization tech-

niques (search methods) which can be used. We have also shown that the well-known Relief
algorithm [61] approximates a gradient ascent algorithm that maximizes this measure. The

nature of the dierent algorithms presented here was demonstrated on various feature selec-

tion tasks. It was shown that our new algorithm Simba, which is a gradient ascent on our

margin based measure, outperforms Relief on all these tasks. One of the main advantages

of the margin based criterion is the high correlation that it exhibits with feature quality.

This was demonstrated in gures 3.3 and 3.4.

The margin based criterion was developed using the 1-Nearest-Neighbor classier but we

expect it to work well for any distance based classier. Additionally to the test we made

with 1-NN, we also tested our algorithms with SVM-RBF classier and showed that they

compete successfully with the state-of-the-art algorithm that was designed specically for

SVM and is much more computationally demanding.

Our main theoretical result in this chapter is a new rigorous bound on the nite sample

generalization error of the 1- Nearest Neighbor algorithm. This bound depends on the margin

obtained following the feature selection.

In the experiments we have conducted, the merits of the new algorithms were demon-

strated. However, our algorithms use the Euclidean norm and assume that it is meaningful

as a measure of similarity in the data. When this assumption fails, our algorithms might not

work. Coping with other similarity measures will be an interesting extension to the work

presented here.

The user of G-ip or Simba should choose a utility function to work with. Here we have

demonstrated three such functions: linear, zero-one and sigmoid utility functions. The linear

utility function gives equal weight to all points and thus might be sensitive to outliers. The
66 Chapter 3. Margin Based Feature Selection

sigmoid utility function suppresses the inuence of such outliers. We have also experimented

with the fact that G-ip with zero-one utility uses fewer features than G-ip with linear

utility, while the sigmoid utility lies in between. It is still an open problem how to adapt

the right utility for the data being studied. Nevertheless, much like the choice of kernel

for SVM, using a validation set it is possible to nd a reasonable candidate. A reasonable

initial value for the parameter β of the sigmoid utility is something on the same order of

magnitude as one over the average distance between training instances. As for the choice

between G-ip and Simba ; G-ip is adequate when the goal is to choose the best feature

subset, without a need to control its precise size. Simba is more adequate when ranking the

features is required.

A Complementary Proofs for Chapter 3


We begin by proving a simple lemma which shows that the class of nearest neighbor classiers

is a subset of the class of 1-Lipschitz functions. Let nnSF (·) be a function such that the sign
of nnSF (x) is the label that the nearest neighbor rule assigns to x, while the magnitude is
the sample-margin, i.e. the distance between x and the decision boundary.

Lemma 3.3 Let F be a set of features and let S be a labeled sample. Then for any x1 , x2 ∈
RN :
¯ S ¯
¯nnF (x1 ) − nnSF (x2 )¯ ≤ kF (x1 ) − F (x2 )k

where F (x) is the projection of x on the features in F .

Proof. Let x1 , x2 ∈ X . We split our argument into two cases. First assume that nnSF (x1 )
and nnSF (x2 ) have the same sign. Let z1 , z2 ∈ R|F | be the points on the decision boundary

of the 1-NN rule which are closest to F (x1 ) and F (x2 ) respectively. From the denition of

z1,2 it follows that nnSF (x1 ) = kF (x1 ) − z1 k and nnSF (x2 ) = kF (x2 ) − z2 k and thus

nnSF (x2 ) ≤ kF (x2 ) − z1 k

≤ kF (x2 ) − F (x1 )k + kF (x1 ) − z1 k

= kF (x2 ) − F (x1 )k + nnSF (x1 ) (3.5)


A. Complementary Proofs for Chapter 3 67

By repeating the above argument while reversing the roles of x1 and x2 we get

nnSF (x1 ) ≤ kF (x2 ) − F (x1 )k + nnSF (x2 ) (3.6)

Combining (3.5) and (3.6) we obtain

¯ S ¯
¯nnF (x2 ) − nnSF (x1 )¯ ≤ kF (x2 ) − F (x1 )k

The second case is when nnSF (x1 ) and nnSF (x2 ) have alternating signs. Since nnSF (·) is
continuous, there is a point z on the line connecting F (x1) and F (x2 ) such that z is on the

decision boundary. Hence,

¯ S ¯
¯nnF (x1 )¯ ≤ kF (x1 ) − zk
¯ S ¯
¯nnF (x2 )¯ ≤ kF (x2 ) − zk

and so we obtain

¯ S ¯ ¯ ¯ ¯ ¯
¯nnF (x2 ) − nnSF (x1 )¯ = ¯nnSF (x2 )¯ + ¯nnSF (x1 )¯

≤ kF (x2 ) − zk + kF (x1 ) − zk

= kF (x2 ) − F (x1 )k

The main tool for proving theorem 3.1 is the following:

Theorem 3.2 [9] Let H be a class of real valued functions. Let S be a sample of size m
generated i.i.d. from a distribution D over X × {±1} then with probability 1 − δ over the
choices of S , every h ∈ H and every γ ∈ (0, 1] let d = fatH (γ/32):
√ ( ( ) ( ))
2 34em 8
erD (h) ≤ er (h) +
ˆ γS d ln log (578m) + ln
m d γδ

We now turn to prove theorem 3.1:

Proof (of theorem 3.1): Let F be a set of features such that |F | = n and let γ > 0. In

order to use theorem 3.2 we need to compute the fat-shattering dimension of the class of

nearest neighbor classication rules which use the set of features F. As we saw in lemma 3.3

this class is a subset of the class of 1-Lipschitz functions on these features. Hence we can
68 Chapter 3. Margin Based Feature Selection

bound the fat-shattering dimension of the class of NN rules by the dimension of Lipschitz

functions.

Since D is supported in a ball of radius R and kxk ≥ kF (x)k, we need to calculate the

fat-shattering dimension of Lipschitz functions acting on points in Rn with norm bounded

by R. The fatγ -dimension of the 1-NN functions on the features F is thus bounded by the

|F |
largest γ packing of a ball in Rn with radius R, which in turn is bounded by (2R/γ) .

Therefore, for a xed set of features F we can apply to theorem 3.2 and use the bound

on the fat-shattering dimension just calculated. Let δF > 0 and we have according to

theorem 3.2 with probability 1 − δF over sample S of size m that for any γ ∈ (0, 1]

erD (nearest-neighbor) ≤ ˆ γS (nearest-neighbor)


er + (3.7)
√ ( ( ) ( ))
2 34em 8
d ln log (578m) + ln
m d γδF
( ( )) ∑
|F | N
for d = (64R/γ) . By choosing δF = δ/ N |F | we have that F ⊆[1...N ] δF =δ and so

we can apply the union bound to (3.7) and obtain the stated result.
Chapter 4

Feature Selection For Regression

and its Application to Neural

Activity1

In this chapter we discuss feature selection for regression (aka function estimation). Once

again we use the Nearest Neighbor algorithm and an evaluation function which is similar

in nature to the one used for classication in chapter 3. This way we develop a non-linear,

simple yet eective feature subset selection method for regression. Our algorithm is able to

capture complex dependency of the target function on its input and makes use of the leave-

one-out error as a natural regularization. We explain the characteristics of our algorithm on

synthetic problems and use it in the context of predicting hand velocity from spikes recorded

in motor cortex of a behaving monkey. By applying feature selection we are able to improve

prediction quality and suggest a novel way of exploring neural data.

The selection paradigm presented in section 1.2.1 of the introduction which involves

an evaluation function and a search method is adequate for regression as well, but the

evaluation function should be suitable for regression; i.e., consider the continuous properties

1 The results presented in this chapter were rst presented in our NIPS05 paper titled Nearest Neighbor
Based Feature Selection for Regression and its Application to Neural Activity [85].

69
70 Chapter 4. Feature Selection For Regression

of the target function. A possible choice of such an evaluation function is the leave-one-

out (LOO) mean square error (MSE) of the k-Nearest-Neighbor (kNN) estimator ([27, 7]).

This evaluation function has the advantage that it both gives a good approximation of

the expected generalization error and can be computed quickly. [79] used this criterion on

small synthetic problems (up to 12 features). They searched for good subsets using forward
selection, backward elimination and an algorithm (called schemata ) that races feature sets

against each other (eliminating poor sets, keeping the ttest) in order to nd a subset with

a good score. All these algorithms perform a local search by ipping one or more features

at a time. Since the space is discrete the direction of improvement is found by trial and

error, which slows the search and makes it impractical for large scale real world problems

involving many features.

We extend the LOO-kNN-MSE evaluation function to assign scores to weight vectors


over the features, instead of just to feature subsets. This results in a smooth (almost

everywhere) function over a continuous domain, which allows us to compute the gradient

analytically and to employ a stochastic gradient ascent to nd a locally optimal weight

vector. The resulting weights provide a ranking of the features, which we can then threshold

in order to produce a subset. In this way we can apply an easy-to-compute, gradient directed

search, without relearning a regression model at each step but while still employing a strong

non-linear function estimate (kNN) that can capture complex dependency of the function

on its features.

Our original motivation for developing this method was to address a major computa-

tional neuroscience question: which features of the neural code are relevant to the observed

behavior? This is an important key to the interpretability of neural activity. Feature se-

lection is a promising tool for this task. Here, we apply our feature selection method to

the task of reconstructing hand movements from neural activity, which is one of the main

challenges in implementing brain computer interfaces [109]. We look at neural population

spike counts, recorded in motor cortex of a monkey while it performed hand movements and

locate the most informative subset of neural features. We show that it is possible to improve

prediction results by wisely selecting a subset of cortical units and their time lags relative to

the movement. Our algorithm, which considers feature subsets, outperforms methods that
4.1. Preliminaries 71

consider features on an individual basis, suggesting that complex dependency on a set of

features exists in the code.

4.1 Preliminaries
Let g (x), g : RN −→ R be a function that we wish to estimate. Given a set S ⊂ RN ,

the empiric mean square error (MSE) of an estimator ĝ for g is dened as M SES (ĝ) =
∑ 2
1
|S| x∈S (g (x) − ĝ (x)) .

kNN Regression As already explained in the previous chapter, k-Nearest-Neighbor (kNN)


is a simple, intuitive and ecient way to estimate the value of an unknown function in a given

point using its values in other (training) points. In section 3.1 we described the the classi-

cation version of kNN. Here we use the kNN for a regression problem, so instead of using

majority over the labels of the k nearest neighbors, we take the average of the function value.

Formally, Let S = {x1 , . . . , xm } be a set of training points. The kNN estimator is dened as

the mean function value of the nearest neighbors: ĝ(x) = 1
k x0 ∈[email protected] (x) ĝ(x0 )

where [email protected] (x) ⊂ S is the set of k nearest points to x in S and k is a

parameter([27, 7]). A softer version takes a weighted average, where the weight of each

neighbor is proportional to its proximity. One specic way of doing this is

1 ∑ 0
ĝ(x) = g(x0 )e−d(x,x )/β (4.1)
Z
x0 ∈[email protected] (x)

∑ 0
d (x, x0 ) = kx − x0 k2 e−d(x,x )/β
2
where is the `2 norm, Z = x0 ∈[email protected] (x) is a

normalization factor and β is a parameter. The soft kNN version will be used in the remain-

der of this paper. This regression method is a special form of locally weighted regression (See
[7] for an overview of the literature on this subject). It has the desirable property that no

learning (other than storage of the training set) is required for the regression. Also note that

the Gaussian Radial Basis Function has the form of a kernel ([116]) and can be replaced

with any operator on two data points that decays as a function of the dierence between

them (e.g. kernel induced distances). As will be seen in the next section, we use the MSE

of a modied kNN regressor to guide the search for a set of features F ⊂ {1, . . . n} that
72 Chapter 4. Feature Selection For Regression

achieves a low MSE. However, the MSE and the Gaussian kernel can be replaced by other

loss measures and kernels (respectively) as long as they are dierentiable almost everywhere.

4.2 The Feature Selection Algorithm


In this section we present our selection algorithm called RGS (Regression, Gradient guided,

feature Selection). It can be seen as a lter method for general regression algorithms or as

a wrapper for estimation by the kNN algorithm.

Our goal is to nd subsets of features that induce a small estimation error. As in most

supervised learning problems, we wish to nd subsets that induce a small generalization

error, but since it is not known, we use an evaluation function on the training set. This

evaluation function is dened not only for subsets but for any weight vector over the features.

This is more general because a feature subset can be represented by a binary weight vector

that assigns a value of one to features in the set and zero to the rest of the features.

For a given weights vector over the features w ∈ Rn , we consider the weighted squared `2
2 ∑
norm induced by w, dened as kzkw = 2 2
i zi wi and use the k nearest neighbors according

to the distance induced by this norm. We use Nw (x) to denote the set of these k neighbors.

Given a training set S, we denote by ĝw (x) the value assigned to x by a weighted kNN

estimator, dened in equation 4.1, using the weighted squared `2 -norm as the distances

d(x, x0 ) and the nearest neighbors, Nw (x), are found among the points of S excluding x:

1 ∑ ∑
(xi −x0i )
2
ĝw (x) = g(x0 )e− i wi2 /β
, Nw (x) ⊂ S \ x
Z
x0 ∈Nw (x)

The evaluation function is dened as the negative MSE of the weighted kNN estimator:

1∑ 2
e(w) = − (g(x) − ĝw (x)) . (4.2)
2
x∈S

This evaluation function scores weight vectors (w). A change of weights will cause a

change in the distances and, possibly, the identity of each point's nearest neighbors, which

will change the function estimates. A weight vector that induces a distance measure in

which neighbors have similar labels receives a high score. Note that there is no explicit

regularization term in e(w). This is justied by the fact that for each point, the estimate of
4.2. The Feature Selection Algorithm 73

Algorithm 4 RGS (S, k, β, T )


1. initialize w = (1, 1, . . . , 1)
2. for t = 1...T

(a) pick randomly an instance x from S


(b) evaluate the gradient of e(w) on x:

∇e(w) = − (g(x) − ĝw (x)) ∇w ĝw (x)



− β4 x00 ,x0 ∈Nw (x) g(x00 )a(x0 , x00 ) u(x0 , x00 )
∇w ĝw (x) = ∑ 0 00
x00 ,x0 ∈Nw (x) a(x , x )

0 2 00 2
where a(x0 , x00 ) = e−(||x−x ||w +||x−x ||w )/β [ ]
and u(x0 , x00 ) ∈ RN is a vector with ui = wi (xi − x0i )2 + (xi − x00i )2 .

(c) w = w + ηt ∇e(w) = w (1 + ηt ∇w ĝw (x)) where ηt is a decay factor.

its function value does not include that point as part of the training set. Thus, equation 4.2

is a leave-one-out cross validation error. Clearly, it is impossible to go over all the weight

vectors (or even over all the feature subsets), and therefore some search technique is required.

Our method nds a weight vector w that locally maximizes e(w) as dened in (4.2) and

then uses a threshold in order to obtain a feature subset. The threshold can be set either

by cross validation or by nding a natural cuto in the weight values. However, we later

show that using the distance measure induced by w in the regression stage compensates for

taking too many features. Since e(w) is dened over a continuous domain and is smooth

almost everywhere we can use gradient ascent in order to maximize it. RGS (algorithm 4)

is a stochastic gradient ascent over e(w). In each step the gradient is evaluated using one

sample point and is added to the current weight vector. RGS considers the weights of all

the features at the same time and thus it can handle dependency on a group of features.

This is demonstrated in section 4.3. In this respect, it is superior to selection algorithms

that score each feature independently. It is also faster than methods that try to nd a good

subset directly by trial and error. Note, however, that convergence to a global optimum is

not guaranteed, and standard techniques to avoid local optima can be used.

The parameters of the algorithm are k (number of neighbors), β (Gaussian decay factor),

T (number of iterations) and {ηt }Tt=1 (step size decay scheme). The value of k can be tuned
74 Chapter 4. Feature Selection For Regression

by cross validation, however a proper choice of β can compensate for a k that is too large.

It makes sense to tune β to a value that places most neighbors in an active zone of the

Gaussian. In our experiments, we set β to half of the mean distance between points and their

k neighbors. It usually makes sense to use ηt that decays over time to ensure convergence,

however, on our data, convergence was also achieved with ηt = 1.

The computational complexity of RGS is Θ(T N m) where T is the number of iterations,

N is the number of features and m is the size of the training set S. This is correct for a

naive implementation which nds the nearest neighbors and their distances from scratch at

each step by measuring the distances between the current point to all the other points. RGS

is basically an on-line method which can be used in batch mode by running it in epochs

on the training set. When it is run for only one epoch, T = m and the complexity is
( )
Θ m2 N . Matlab code for this algorithm (and those that we compare it with) is available

at www.cs.huji.ac.il/labs/learning/code/fsr/

4.3 Testing on synthetic data


The use of synthetic data, where we can control the importance of each feature, allows us

to illustrate the properties of our algorithm. We compare our algorithm to other common

selection methods: infoGain [93], correlation coecients ( corrcoef ) and forward selection
(see [43]). infoGain and corrcoef simply rank features according to the mutual information
2

or the correlation coecient (respectively) between each feature and the labels (i.e. the

target function value). Forward selection ( fwdSel ) is a greedy method in which features

are iteratively added into a growing subset. In each step, the feature showing the greatest

improvement (given the previously selected subset) is added. This is a search method that

can be applied to any evaluation function and we use our criterion (equation 4.2 on feature

subsets). This well known method has the advantages of considering feature subsets and

that it can be used with non linear predictors. Another algorithm we compare with scores

each feature independently using our evaluation function (4.2). This helps us in analyzing

RGS, as it may help single out the respective contributions to performance of the properties
2 Feature and function values were binarized by comparing them to the median value.
4.3. Testing on synthetic data 75

1 1

0.5 0
1 1
0 −1
1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0 0 0
(a) (b)
2 1 −1 −1
1
0
0 1 1
1 1
−1 0.5 0.5 0.5 0.5
1 1 1 1
0.5 0.5 0.5 0.5 0 0 0 0
0 0
(c)
0 0
(d) (e) (f)

Figure 4.1: (a)-(d): Illustration of the four synthetic target functions. The plots shows the
function's value as function of the rst two features. (e),(f ): demonstration of the eect of
feature selection on estimating the second function using kNN regression (k = 5, β = 0.05).
(e) using both features (mse = 0.03), (f ) using the relevant feature only (mse = 0.004)

of the evaluation function and the search method. We refer to this algorithm as SKS (Single

feature, kNN regression, feature Selection) .


We look at four dierent target functions over R50 . The training sets include 20 to 100

points that were chosen randomly from the [−1, 1]50 cube. The target functions are given

in the top row of gure 4.2 and are illustrated in gure 4.1(a-d). A random Gaussian noise

with zero mean and a variance of 1/7 was added to the function value of the training points.

Clearly, only the rst feature is relevant for the rst two target functions, and only the rst

two features are relevant for the last two target functions. Note also that the last function

is a smoothed version of parity function learning and is considered hard for many feature

selection algorithms [43].

First, to illustrate the importance of feature selection on regression quality we use kNN to

estimate the second target function. Figure 4.1(e-f ) shows the regression results for target

(b), using either only the relevant feature or both the relevant and an irrelevant feature.

The addition of one irrelevant feature degrades the MSE ten fold. Next, to demonstrate

the capabilities of the various algorithms, we run them on each of the above problems with

varying training set size. We measure their success by counting the number of times that

the relevant features were assigned the highest rank (repeating the experiment 250 times by

re-sampling the training set). Figure 4.2 presents the success rate as function of training set

size.

We can see that all the algorithms succeeded on the rst function which is monotonic

and depends on one feature alone. infoGain and corrcoef fail on the second, non-monotonic
76 Chapter 4. Feature Selection For Regression

(a) x2
1 (b) sin(2πx1 + π/2) (c) sin(2πx1 + π/2) + x2 (d) sin(2πx1 ) sin(2πx2 )

100 100 100 100 corrcoef


success rate

80 80 80 infoGain
80 SKS
60 60 60 60 fwdSel
RGS
40 40 40 40
20 20 20 20

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100


# examples # examples # examples # examples

Figure 4.2: Success rate of the dierent algorithms on 4 synthetic regression tasks
(averaged over 250 repetitions) as a function of the number of training examples. Success
is measured by the percent of the repetitions in which the relevant feature(s) received rst
place(s).

function. The three kNN based algorithms succeed because they only depend on local

properties of the target function. We see, however, that RGS needs a larger training set

to achieve a high success rate. The third target function depends on two features but the

dependency is simple as each of them alone is highly correlated with the function value. The

fourth, XOR-like function exhibits a complicated dependency that requires consideration of

the two relevant features simultaneously. SKS which considers features separately sees the

eect of all other features as noise and, therefore, has only marginal success on the third

function and fails on the fourth altogether. RGS and fwdSel apply dierent search methods.

fwdSel considers subsets but can evaluate only one additional feature in each step, giving it

some advantage over RGS on the third function but causing it to fail on the fourth. RGS
takes a step in all features simultaneously. Only such an approach can succeed on the fourth

function.

4.4 Hand Movement Reconstruction from Neural Activ-


ity
To suggest an interpretation of neural coding we apply RGS and compare it with the alter-

natives presented in the previous section


3 on the hand movement reconstruction task. The

data sets were collected while a monkey performed a planar center-out reaching task with

one or both hands [88]. 16 electrodes, inserted daily into novel positions in primary motor

3 fwdSel was not applied due to its intractably high run time complexity. Note that its run time is at
least r times that of RGS where r is the size of the optimal set and is longer in practice.
4.4. Hand Movement Reconstruction from Neural Activity 77

0.74 0.09 0.10 0.77 RGS


SKS
infoGain
corrcoef

0.63 0.06 0.08 0.27


7 200 400 600 7 200 400 600 7 200 400 600 7 200 400 600
Figure 4.3: MSE results for the dierent feature selection methods on the neural activity
data sets. Each sub gure is a dierent recording day. MSEs are presented as a function of
the number of features used. Each point is a mean over all 5 cross validation folds, 5
permutations on the data and the two velocity component targets. Note that some of the
data sets are harder than others.

cortex were used to detect and sort spikes in up to 64 channels (4 per electrode). Most of

the channels detected isolated neuronal spikes by template matching. Some, however, had

templates that were not tuned, producing spikes during only a fraction of the session. Others

(about 25%) contained unused templates (resulting in a constant zero producing channel or,

possibly, a few random spikes). The rest of the channels (one per electrode) produced spikes

by threshold passing. We construct a labeled regression data set as follows. Each example

corresponds to one time point in a trial. It consists of the spike counts that occurred in the

10 previous consecutive 100ms long time bins from all 64 channels (64 × 10 = 640 features)

and the label is the X or Y component of the instantaneous hand velocity. We analyze data

collected over 8 days. Each data set has an average of 5050 examples collected during the

movement periods of the successful trials.

In order to evaluate the dierent feature selection methods we separate the data into

training and test sets. Each selection method is used to produce a ranking of the features.

We then apply kNN (based on the training set) using dierent size groups of top ranking

features to the test set. We use the resulting MSE (or correlation coecient between true

and estimated movement) as our measure of quality. To test the signicance of the results

we apply 5-fold cross validation and repeat the process 5 times on dierent permutations

of the trial ordering. Figure 4.3 shows the average (over permutations, folds and velocity

components) MSE as a function of the number of selected features on four of the dierent

data sets (results on the rest are similar and omitted due to lack of space) .
4 It is clear

that RGS achieves better results than the other methods throughout the range of feature

numbers.

4 We use k = 50 (approximately 1% of the data points). β is set automatically as described in section 4.2.
These parameters were manually tuned for good kNN results and were not optimized for any of the feature
selection algorithms. The number of epochs for RGS was set to 1 (i.e. T = m).
78 Chapter 4. Feature Selection For Regression

To test whether the performance of RGS was consistently better than the other methods

we counted winning percentages (the percent of the times in which RGS achieved lower

MSE than another algorithm) in all folds of all data sets and as a function of the number of

features used. Figure 4.4 shows the winning percentages of RGS versus the other methods.

For a very low number of features, while the error is still high, RGS winning scores are

only slightly better than chance but once there are enough features for good predictions the

winning percentages are higher than 90%. In gure 4.3 we see that the MSE achieved when

using only approximately 100 features selected by RGS is better than when using all the

features. This dierence is indeed statistically signicant (win score of 92%). If the MSE is

replaced by correlation coecient as the measure of quality, the average results (not shown

due to lack of space) are qualitatively unchanged.

RGS not only ranks the features but also gives them weights that achieve locally optimal

results when using kNN regression. It therefore makes sense not only to select the features

but to weigh them accordingly. Figure 4.5 shows the winning percentages of RGS using

the weighted features versus RGS using uniformly weighted features. The corresponding

MSEs (with and without weights) on the rst data set are also displayed. It is clear that

using the weights improves the results in a manner that becomes increasingly signicant as

the number of features grows, especially when the number of features is greater than the

optimal number. Thus, using weighted features can compensate for choosing too many by

diminishing the eect of the surplus features.

To take a closer look at what features are selected, gure 4.6 shows the 100 highest

ranking features for all algorithms on one data set. Similar selection results were obtained

in the rest of the folds. One would expect to nd that well isolated cells (template matching)

are more informative than threshold based spikes. Indeed, all the algorithms select isolated

cells more frequently within the top 100 RGS


features ( does so in 95% of the time and the

rest in 70%-80%). A human selection of channels, based only on looking at raster plots

and selecting channels with stable ring rates was also available to us. This selection was

independent of the template/threshold categorization. Once again, the algorithms selected

the humanly preferred channels more frequently than the other channels. Another and more

interesting observation that can also be seen in the gure is that while corrcoef, SKS and
4.5. Summary 79

RGS

SKS
corrCoef

infoGain
Figure 4.6: 100 highest ranking features (grayed out) selected by the algorithms. Results
are for one fold of one data set. In each sub gure the bottom row is the (100ms) time bin
with least delay and the higher rows correspond to longer delays. Each column is a channel
(silent channels omitted).

infoGain tend to select all time lags of a channel, RGS 's selections are more scattered (more
channels and only a few time bins per channel). Since RGS achieves the best results, we

100 100 0.8

90 90
winning percentage
winning percentage

80 RGS vs SKS 80 winning percentages

MSE
RGS vs infoGain uniform weights

70 RGS vs corrcoef 70 non−uniform weights

60 60

50 50 0.6
0 100 200 300 400 500 600 0 100 200 300 400 500 600
number of features
number of features

Figure 4.4: Winning percentages of RGS Figure 4.5: Winning percentages of RGS
over the other algorithms. RGS achieves with and without weighting of features

better MSEs consistently. (black). Gray lines are corresponding MSEs


of these methods on the rst data set.

conclude that this selection pattern is useful. Apparently RGS found these patterns

thanks to its ability to evaluate complex dependency on feature subsets. This suggests that

such dependency of the behavior on the neural activity does exist.

4.5 Summary
In this chapter we presented a new method of selecting features for function estimation and

use it to analyze neural activity during a motor control task . We used the leave-one-out

mean squared error of the kNN estimator and minimize it using a gradient ascent on an
80 Chapter 4. Feature Selection For Regression

almost smooth function. This yields a selection method which can handle a complicated

dependency of the target function on groups of features yet can be applied to large scale

problems. This is valuable since many common selection methods lack one of these prop-

erties. By comparing the result of our method to other selection methods on the motor

control task, we showed that consideration of complex dependency helps to achieve better

performance. These results suggest that this is an important property of the code.
Chapter 5

Learning to Select Features1

In the standard framework of feature selection discussed in previous chapters, the task is

to nd a (small) subset of features out of a given set of features that is sucient to predict

the target labels well . Thus, the feature selection methods discussed so far of tell us which

features are better. However, they do not tell us what characterizes these features or how

to judge new features which were not evaluated using the labeled data. In this chapter

we extend the standard framework and present a novel approach to the task of feature

selection that attempts to learn properties of good features. We claim that in many cases

it is natural to represent each feature by a set of properties, which we call meta-features.


As a simple example, in image related tasks where the features are gray-levels of pixels,

two meta-features can be the (x, y) position of each pixel. We use the training set in order

to learn the relation between the meta-feature values and feature usefulness. This in turn

enables us to predict the quality of unseen features. This may be useful in many applications,

e.g., for predicting the future usefulness of a new word (that did not appear in the training

set) for text classication. Another possible application is for predicting the worthiness

of a candidate medical examination based on the properties of this examination and the

properties and the worthiness of previous examinations. It is also a very useful tool for

feature extraction. As described in section 1.1.2, in feature extraction the original features

1 The results presented in this chapter were rst presented in a paper titled Learning to Select Features
using their Properties submitted to JMLR at 29 August, 2006.

81
82 Chapter 5. Learning to Select Features

are used to generate new more complex features. One example for such extracted features

in image related tasks are all the products of sets of 3 pixels. Any function of the original

features can be used as an extracted feature, and there are a virtually innite number of

such functions. Thus the learner has to decide which potential complex features have a good

chance of being the most useful. For this task we derive a selection algorithm (called Mufasa )
that uses meta-features to explore a huge number of candidate features eciently. We also

derive generalization bounds for the joint problem of feature selection (or extraction) and

classication when the selection is made using meta-features. These bounds are better than

the bounds obtained for direct selection.

We also show how our concept can be applied in the context of inductive transfer ([10,

110, 16]). As described in section 1.1.5, in inductive transfer one tries to use knowledge

acquired in previous tasks to enhance performance on the current task. The rationale lies

in the observation that a human does not learn isolated tasks, but rather learns many tasks

in parallel or sequentially. In this context, one key question is the kind of knowledge we

can transfer between tasks. One option, which is very popular, is to share the knowledge

about the representation, or more specically, knowledge about feature usefulness. Here we

suggest and show that it might be better to share knowledge about the properties of good

features instead of knowledge about which features are good. The problem of handwritten

digit recognition is used throughout this chapter to illustrate the various applications of our

novel approach.

Related work: [108] used meta-features of words for text classication when there are

features (words) that are unseen in the training set, but appear in the test set. In their work

the features are words and the meta-features are words in the neighborhood of that word.

They used the meta-features to predict the role of words that are unseen in the training

set. Generalization from observed (training) features to unobserved features is discussed in

[66]. Their approach involves clustering the instances based on the observed features. What

these works and ours have in common is that they all extend the learning from the standard

instance-label framework to learning in the feature space. Our formulation here, however, is

dierent and allows a mapping of the feature learning problem onto the standard supervised
5.1. Formal Framework 83

learning framework (see Table 5.1). Another related model is Budget Learning ([75, 42]),

that explores the issue of deciding which is the most valuable feature to measure next under

a limited budget. Other ideas using feature properties to produce or select good features can

be found in the literature and have been used in various applications. For instance [71] used

this rationale in the context of inductive transfer for object recognition. Very recently, [95]

also used this approach in the same context for text classication. They use a property of

pairs of words which indicates whether they are synonyms or not for the task of estimating

the words' covariance matrix. [58] used property-based clustering of features for handwritten

Chinese recognition and other applications. Our formulation encompasses a more general

framework and suggests a systematic way to use the properties as well as derive algorithms

and generalization bounds for the combined process of feature selection and classication.

5.1 Formal Framework


Recall that in the standard supervised learning framework (see section 1.1.1) it is assumed

that each instance is a vector x ∈ RN and the N coordinates are the features. However, we

can also consider the instances as abstract entities in the space S and think of the features

as measurements on the instances. Thus each feature f can be considered as a function from

S to R, i.e., f : S → R. We denote the set of all the features by {fj }N


j=1 . In the following

we use the term feature to describe both raw input variables (e.g., pixels in an image) and

variables constructed from the original input variables using some function (e.g., product of

3 pixels in the image).

Also, recall that the standard task of feature selection is to select a subset of the given

N features that enables good prediction of the label (see section 1.2). As described in the

previous chapters, this is done by looking for features which are more useful than the others.

However, as already mentioned, in this chapter we want to extend this framework and to

nd what characterizes the better features. Thus we further assume that each feature is

described by a set of properties u (·) = {ur (·)}kr=1 which we call meta-features. Formally,

each ur (·) is a function from the space of possible measurements to R. Thus each feature f

is described by a vector u(f ) = (u1 (f ) , . . . , uk (f )) ∈ Rk . Note that the meta-features are


84 Chapter 5. Learning to Select Features

Table 5.1: Feature learning by meta-features as a standard supervised learning

Training set Features described by meta-features


Test set Unobserved features
Labels Feature quality
Hypothesis class Class of mappings from meta-features to quality
Generalization in feature selection Predicting the quality of new features
Generalization in the joint problem Low classication error

not dependent on the instances, a fact which is important for understanding our analysis in

what follows. We also denote a general point in the image of u (·) by u. The new framework

we introduce in this chapter forces us to introduce some not very standard notations. In

order to make it easier for the reader to become familiarized with this notation we provide

a summary table in the appendix A at the end of this chapter.

5.2 Predicting the Quality of Features


Before we describe the more useful applications, we start by describing the most obvious use

of the framework presented in this chapter. Here we assume that we observe only a subset

of the N features; i.e., that in the training set we see only the value of some of the features.

We can directly measure the quality (i.e. usefulness) of these features using the training set,

but we also want to be able to predict the quality of the unseen features. Thus we want to

think of the training set not only as a training set of instances, but also as a training set

of features.

Thus we want to predict the quality of a new unseen feature. At this stage we evaluate

the quality of each feature alone .


2 More formally, our goal is to use the training set Sm

and the set of meta-features for learning a mapping Q̂ : Rk −→ R that predicts the quality

of a feature using the values of its meta-features. For this we assume that we have a

way to measure the quality of each feature that does appear in the training set. This

measure can be any kind of standard evaluation function that uses the labeled training set

to evaluate features (e.g., Infogain or wrapper based). YM F denotes the vector of measured

2 We discuss the evaluation of subsets in sections 5.3 and 5.6.


5.2. Predicting the Quality of Features 85

Algorithm 5 Q̂ =quality_map(S m , featquality, regalg)


1. measure the feature quality vector: YM F = featquality (S m )
2. calculate the N ×k meta features matrix XM F

3. use the regression alg. to learn a mapping from meta feature value to quality: Q̂ =
regalg (XM F , YM F )

qualities, i.e. YM F (j) is the measured quality of the j 's feature in the training set. Now

we have a new supervised learning problem, with the original features as instances,
the meta-features as features and YM F as the (continuous) target label. The analogy

to the standard supervised problem is summarized in Table 5.1. Thus we can use any

standard regression learning algorithm to nd the required mapping from meta-features to

quality. The above procedure is summarized in algorithm 5. Note that this procedure uses

a standard regression learning procedure. That is, the generalization ability to new features

can be derived using standard generalization bounds for regression learning.

We now present a toy illustration of the ability to predict the quality of unseen features

by the above procedure using a handwritten digit recognition task (OCR). For this purpose,

we used the MNIST ([70]) dataset which contains images of 28 × 28 pixels of centered digits

(0 . . . 9). We converted the pixels from gray-scale to binary by thresholding. Here we use

the 784 original pixels as the input features and the (x, y) location of the pixel as meta-

features. Recall that we do not try to generalize along the instance dimension, but instead

try to generalize along the feature dimension, using the meta-features. Thus we rst choose

a xed set of 2000 images from the dataset. Then we put aside a random subset of the 392

features as a test set of features. Then we use a growing number of training features, which

were chosen randomly out of the remaining 392 features. Our measure of quality here is

Infogain. Now, given a training set of features we use a linear regression


3 to learn a mapping

from meta-features to quality. Using this mapping we predict the quality of the 392 testing

features.

We check the accuracy of the prediction by comparing it to the quality that was measured

3 The linear regression is done over an RBF representation of the (x, y) location. We used49 Gaussians
(with std=3 pixels) located on a grid over the image and represent the location by a 49-dimensional vector
of the responses of the Gaussians in this location.
86 Chapter 5. Learning to Select Features

0.9
Corr. coef. (ρ)

0.8

0.7

0.6
50 100 150 200 250 300 350 400
Number of features used for training

Figure 5.1: The correlation coecient between the predicted quality and the real quality of
testing features as a function of the number of training features. Error bars show the
standard deviation.

by directly applying Infogain on the testing features. We do the comparison in two dierent

ways: (1) by correlation coecient between the two. (2) By checking the accuracy of the

1-Nearest-Neighbor (1-NN) classier on a test set of 2000 instances, using a growing number

of selected features out of the 392 testing features. The features were sorted by either the

predicted quality or the direct Infogain. In order to check the statistical signicance, we

repeat the experiment 20 times by re-splitting into train and test sets. The results are

presented in Figure 5.1 and Figure 5.2 respectively. It is clear from the graphs that if we

have enough training features, we can predict new feature qualities with high accuracy. In

Figure 5.2 we see that the classication error of using the predicted quality is similar to the

error obtained by direct use of Infogain. This occurs even though our prediction was done

using only ∼1/8 of the features.

5.3 Guided Feature Extraction


Here we show how the low dimensional representation of features by a relatively small number

of meta-features enables ecient selection even when the number of potential features is very

large or even innite. This is highly relevant to the feature extraction scenario. We tackle

the problem of choosing which features to generate out of the huge number of potential

features in the following way. Instead of evaluating all features, we evaluate a relatively
5.3. Guided Feature Extraction 87

0
10
Regression
Direct Measure
Test error
Rand

0 50 100 150 200 250 300 350 400


Number of features used for classification

Figure 5.2: The classication error of 1-NN on a testing set of instances as a function of
the number of the top features ranked by either prediction using 100 training features or
by direct Infogain on test features. The results of random ranking is also presented. Error
bars show the standard deviation.

small number of features and learn to predict which other features are better. This way we

can guide our search for good features.

5.3.1 Meta-features Based Search

Assume that we want to select (or extract) a set of n features out of the large number of N

potential features. We dene a stochastic mapping from values of meta-features to selection

(or extraction) of features. More formally, let V be a random variable that indicates which

feature is selected. We assume that each point u in the meta-feature space induces density

p (v|u) over the features. Our goal is to nd a point u in the meta-feature space such that

drawing n features (independently) according to p (v|u) has a high probability of giving us

a good set of n features. For this purpose we suggest the Mufasa (for Meta-Features Aided

Search Algorithm) (algorithm 6) which uses stochastic local search in the meta-feature space.

Note that Mufasa does not use explicit prediction of the quality of unseen features as we

did in section 5.2, but it is clear that it cannot work unless the meta-features are informative

on the quality. Namely, Mufasa can only work if the chance of drawing a good set of features
from p (v|u) is some continuous function of u, i.e., a small change in u results in a small

change in the chance of drawing a good set of features. If, in addition, the meta-features
88 Chapter 5. Learning to Select Features

Algorithm 6 Fbest =Mufasa(n, J)


1. Initialization: qbest = maxreal, u ← u0 , where u0 is the initial guess of u.
2. For j = 1...J

(a) Select (or generate) a new set Fj of n random features according to p (w|u).
(b) qj = quality (Fj ). (Any measure of quality, e.g. cross-validation classication
accuracy)

(c) If qj ≥ qbest
Fbest = Fj , ubest = u, qbest = qj
(d) Randomly select new u which is near ubest . For example: u ← ubest + noise.

3. return Fbest

space is simple
4 we expect it to nd a good point in a small number of steps. The number

of steps plays an important role in the generalization bounds we present in section 5.4.1. Like

any local search over a non-convex target function, a convergence to a global optimum is not

guaranteed, but any standard technique to avoid local maxima can be used. In section 5.3.2

we demonstrate the ability of Mufasa to eciently select good features in the presence of a

huge number of candidate (extracted) features on the handwritten digit recognition problem.

In section 5.4.1 we present a theoretical analysis of Mufasa.

5.3.2 Illustration on Digit Recognition Task


In this section we revisit the handwritten digit recognition problem described in section 5.2

and use it to demonstrate how Mufasa works for feature extraction. Once again we use the

MNIST dataset, but this time we deal with more complicated extracted features. These

extracted features are in the following form: logical AND of 1 to 8 pixels (or their negation),

which are referred to as inputs. In addition, a feature can be shift invariant for shifts of up

to shiftInvLen pixels. That means we take a logical OR of the output of all the dierent

locations we can get by shifting.

Thus a feature is dened by specifying the set of inputs, which of the inputs are negated,
and the value of shiftInvLen. Similar features were already used by [30] on the MNIST

dataset, but with a xed number of inputs and without shift invariance. The idea of using

4 We elaborate on the meaning of simple in sections 5.4 and 5.6.


5.3. Guided Feature Extraction 89

shift invariance for digit recognition is also not new, and was used for example by [104]. It

is clear that there are a huge number of such features; thus we have no practical way to

measure or use all of them. Therefore we need some guidance for the extraction process,

and here is the point where the meta-features framework comes in.

We use the following four meta-features:

1. numInputs : the number of inputs (1-8).

2. precentPos : percent of logic positive pixels (0-100, rounded).

3. shiftInvLen : maximum allowed shift value (1-8).

4. scatter : average distance of the inputs from their center of gravity (COG) (1-3.5).

In order to nd a good value for meta-features we use Mufasa (algorithm 6), with dierent

values of allowed budget. We use a budget (and not just the number of features) since

features with large shiftInvLen are more computationally expensive. We dened the cost of
( )
measuring (calculating) a feature as 0.5 1 + a2 , where a is the shiftInvLen of the feature;

this way the cost is proportional to the number of locations we measure the feature. We

use 2000 images as a training set, and the number of steps, J, is 50. We choose specic

features, given a value of meta-features, by re-drawing features randomly from a uniform

distribution over the features that satisfy the given value of the meta-features until the

full allowed budget is used up. We use 2-fold cross validation of the linear Support Vector

Machine (SVM) ([14]) to check the quality of the set of selected features in each step. We

use the multi-class SVM toolbox developed by [23]. Finally, for each value of allowed budget

we check the results obtained by the linear SVM (that uses the selected features) on a test

set of another 2000 images using the selected features.

We compare the results with those obtained using the features selected by Infogain as

follows. We rst draw features randomly using a budget which is 50 times larger, then

we sort them by Infogain normalized by the cost


5 and take a prex that uses the allowed

budget (referred as MF+Norm Infogain). As a sanity check, we also compare the results

to those obtained by doing 50 steps of choosing features of the allowed budget randomly,

i.e. over all possible values of the meta-features. Then we use the set with the lowest

2-fold cross-validation error (referred as MF+Rand). We also compare our results with

5 Infogain without normalization gives worse results.


90 Chapter 5. Learning to Select Features

Mufasa
MF + Rand
MF + Norm. Infogain
Test Error

Polynomial SVM

−1
10

0.5 1 1.5 2 2.5


Total budget 4
x 10

Figure 5.3: Guided feature extraction for digit recognition. The generalization error
rate as a function of the available budget for features, using dierent selection methods.
The number of training instances is 2000. Error bars show a one standard deviation
condence interval. SVM is not limited by the budget, and always implicitly uses all the
products of features. We only present the results of SVM with a polynomial kernel of
degree 2, the value that gave the best results in this case.

SVM with a polynomial kernel of degree 1-4, that uses the original pixels as input features.

This comparison is relevant since SVM with a polynomial kernel of degree k implicitly uses

ALL the products of up to k pixels, and the product is equal to AND for binary pixels.

To evaluate the statistical signicance, we repeat each experiment 10 times, with dierent

partitions into train and test sets. The results are presented in Figure 5.3. It is clear that

Mufasa outperforms the budget-dependent alternatives, and outperforms SVM for budgets

larger than 3000 (i.e. about 600 features). It is worth mentioning that our goal here is not to

compete with the state-of-art results on MNIST, but to illustrate our concept and to compare

the results of the same kind of classier with and without using our meta-features guided

search. Note that our concept can be combined with most kinds of classication, feature

selection, and feature extraction algorithms to improve them as discussed in section 5.8.
5.3. Guided Feature Extraction 91

5 5
Inputs # Shift inv
4 4

3 3
Y−axis: Optimal value

2 2

1 1
2 3 4 2 3 4
(a) (b)

100 4
% Positive Scatter

3
50
2

0 1
2 3 4 2 3 4
(c) (d)
X−axis: Log10(total budget)

Figure 5.4: Optimal value of the dierent meta features as a function of the
budget. Error bars indicate the range where values fall in 90% of the runs. We can see
that the size of the optimal shift invariance, the optimal number input and optimal percent
of positive inputs grow with the budget.

Another benet of the meta-features guided search is that it helps understand the prob-

lem. To see this we need to take a closer look at the chosen values of the meta-features

(ubest ) as a function of the available budget. Figure 5.4 presents the average chosen value

of each meta-feature as a function of the budget. We can see (in Figure 5.4b) that when

the budget is very limited, it is better to take more cheap features rather than fewer more

expensive shift invariant features. On the other hand, when we increase the budget, adding

these expensive complex features is worth it. We can also see that when the budget grows,

the optimal number of inputs grows as does the optimal percent of positive inputs . This oc-

curs because for a small budget, we prefer features that are less specic, and have relatively

high entropy, at the expense of in class variance. For a large budget, we can permit our-

selves to use sparse features (low probability of being 1), but gain specicity. For the scatter

meta-features, there is apparently no correlation between the budget and the optimal value.
92 Chapter 5. Learning to Select Features

The vertical lines (error bars) represent the range of selected values in the dierent runs. It

gives us a sense of the importance of each meta-feature. A smaller error bar indicates higher

sensitivity of the classier performance to the value of the meta-feature. For example, we

can see that performance is sensitive to shiftInvLen and relatively indierent to percentPos.

5.4 Theoretical Analysis


In this section we derive generalization bounds for the combined process of selection and

classication, when the selection process is based on meta-features. We show that in some

cases, these bounds are far better than the bounds that assume each feature can be selected

directly. This is because we can signicantly narrow the number of possible selections, and

still nd a good set of features. In section 5.4.1 we perform the analysis for the case where

the selection is made using Mufasa (algorithm 6). In section 5.4.2 we present a more general

analysis, which is independent of the selection algorithm, but instead assumes that we have

a given class of mappings from meta-features to a selection decision.

5.4.1 Generalization Bounds for Mufasa Algorithm


The bounds presented in this section assume that the selection is made using Mufasa (al-

gorithm 6), but they could be adapted to other meta-feature based selection algorithms.

Before presenting the bounds, we need some additional notations. We assume that the clas-

sier that is going to use the selected features is chosen from a hypothesis class Hc of real

valued functions and the classication is done by taking the sign.

We also assume that we have a hypothesis class Hf s , where each hypothesis is one

possible way to select the n out of N features. Using the training set, our feature selection is

limited to selecting one of the hypotheses that is included in Hf s . As we show later, if Hf s

contains all the possible ways of choosing n out of N features, then we get an unattractive

generalization bound for large values of n and N. Thus we use meta-features to further

restrict the cardinality (or complexity) of Hf s . We have a combined learning scheme of

choosing both hc ∈ Hc and hf s ∈ Hf s . We can view it as choosing a single classier from

Hf s × Hc . In the following paragraphs we analyze the equivalent size of hypothesis space


5.4. Theoretical Analysis 93

Hf s of Mufasa as a function of the number of steps of the algorithm.

For the theoretical analysis, we can reformulate Mufasa into an equivalent algorithm

that is split into two stages, called Mufasa1 and Mufasa2. The rst stage ( Mufasa1 ) is

randomized, but does not use the training set. The second stage ( Mufasa2 ) is deterministic,
and performs the described search. Since Mufasa2 is deterministic, we must generate all po-

tentially required hypotheses in Mufasa1, using the same probability distribution of feature
generating that may be required by step 2a of the algorithm. Then, during the search, the

second stage can deterministically select hypotheses from the hypotheses created in the rst

stage, rather than generate them randomly. Eventually, the second stage selects a single

good hypothesis out of the hypotheses created by Mufasa1 using the search. Therefore, the

size of Hf s , denoted by |Hf s |, is the number of hypotheses created in Mufasa1.


The following two lemmas upper bound |Hf s |, which is a dominant quantity in the

generalization bound. The rst one handles the case where the meta features has discrete

values, and there are a relatively small number of possible values for the meta-features. This

number is denoted by |M F |.

Lemma 5.1 Any run of Mufasa can be duplicated by rst generating J|M F | hypotheses by
Mufasa1 and then running Mufasa2 using these hypotheses only, i.e. using |Hf s | ≤ J|M F |,
where J is the number of iterations made by Mufasa and |M F | is the number of dierent
values the meta-features can be assigned.

Proof. Let Mufasa1 generate J random feature sets for each of the |M F | possible values

of meta-features. The total number of sets we get is J|M F |. We have only J iterations in

the algorithm, and we generated J feature sets for each possible value of the meta-features.

Therefore, it is guaranteed that all hypotheses required by Mufasa2 are available.

Note that in order to use the generalization bound of the algorithm, we cannot consider

only the subset of J hypotheses that was tested by the algorithm. This is because this subset

of hypotheses is aected by the training set (just as one cannot choose a single hypothesis

using the training set, and then claim that the hypotheses space of the classier includes

only one hypothesis). However, from lemma 5.1, we get that the algorithm search within
94 Chapter 5. Learning to Select Features

no more than J|M F | feature selection hypotheses, that were determined without using the

training set.

The next lemma handles the case where the cardinality of all possible values of meta-

features is large relative to 2J , or even innite. In this case we can get a tighter bound that

depends on J but not on |M F |.

Lemma 5.2 Any run of Mufasa can be duplicated by rst generating 2J hypotheses by
Mufasa1 and then running Mufasa2 using these hypotheses only, i.e. using |Hf s | ≤ 2J ,
where J is the number of iterations Mufasa performs.

Proof. In Mufasa1, we build the entire tree of possible random updating of meta-features

(step 2d of algorithm 6) and generate the features. Since we do not know the label, in the

j th iteration we have 2j−1 hypotheses we need to generate. The total number of hypotheses
( J )
is then 2 − 1 .

To state our theorem we also need the following standard denitions:

Denition 5.1 Let D be a distribution over S × {±1} and h : S → {±1} a classication


function. We denote by erD (h) the generalization error of h w.r.t D:

erD (h) = P rs,y∼D [h (s) 6= y]

For a sample S m = {(sk , yk )}m


k=1 ∈ (S × {±1})
m
and a constant γ > 0, the γ -sensitive
training error is:

1
ˆ γS (h) =
er |{i : yi h (si ) < γ}|
m

Now we are ready to present our main theoretical results:

Theorem 5.1 Let Hc be a class of real valued functions. Let S be a sample of size m
generated i.i.d from a distribution D over S × {±1}. If we choose a set of features using
Mufasa, with a probability of 1 − δ over the choices of S , for every hc ∈ Hc and every
γ ∈ (0, 1]:

ˆ γS (hc )+
erD (hc ) ≤ er
5.4. Theoretical Analysis 95
√ ( ( ) ( ) )
2 34em 8
d ln log(578m) + ln + g (J)
m d γδ

where d = f atHc (γ/32) and g (J) = min (J ln 2, ln (J|M F |)) (where J is the number of steps
Mufasa does and |M F | is the number of dierent values the meta-features can be assigned,
if this value is nite, and ∞ otherwise).

Our main tool in proving the the above theorem is the following theorem:

Theorem 5.2 (Bartlett, 1998)


Let H be a class of real valued functions. Let S be a sample of size m generated i.i.d
from a distribution D over S × {±1} ; then with a probability of1 − δ over the choices of S ,
every h ∈ H and every γ ∈ (0, 1]:

ˆ γS (h)+
erD (h) ≤ er
√ ( ( ) ( ))
2 34em 8
d ln log(578m) + ln
m d γδ

where d = f atH (γ/32)

Proof. (of theorem 5.1)


{ }
Let F1 , ..., F|Hf s | be all possible subsets of selected features. From theorem 5.2 we

know that

ˆ γS (hc , Fi )+
erD (hc , Fi ) ≤ er
√ ( ( ) ( ))
2 34em 8
d ln log(578m) + ln
m d γδFi

where erD (hc , Fi ) denote the generalization error of the selected hypothesis for the xed

set of features Fi .

By choosing δF = δ/|Hf s | and using the union bound, we get that the probability that

there exist Fi (1 ≤ i ≤ |Hf s |) such that the equation below does not hold is less than δ

ˆ γS (hc ) +
erD (hc ) ≤ er
√ ( ( ) ( ) )
2 34em 8
d ln log(578m) + ln + ln |Hf s |
m d γδ
96 Chapter 5. Learning to Select Features

Therefore, with a probability of 1 − δ the above equation holds for any algorithm that
{ }
selects one of the feature sets out of F1 , ..., F|Hf s | . Substituting the bounds for |Hf s | from

lemma 5.1 and lemma 5.2 completes the proof.

An interesting point in this bound is that it is independent of the number of selected

features and of the total number of possible features (which may be innite in case of feature
( )
generation). Nevertheless, it can select a good set of features out of O 2J candidate sets.

These sets may be non-overlapping, so the potential number of features that are candidates
( )
is O n2J . For comparison, in chapter 3 we presented a same kind of bound but for direct

feature selection. The bound we presented there has the same form as the bound we present

here, but where g (J) is replaced by a term of O (ln N ), which is typically much larger

than J ln 2. If we substitute N = n2J , then for the experiment described in section 5.3.2,

n ln N = Jn (ln 2n) ∼
= 375000 while ln (J|M F |) ∼
= 11.

5.4.2 VC-dimension of Joint Feature Selection and Classication


In the previous section we presented an analysis which assumes that the selection of features

is made using Mufasa. In this section we turn to a more general analysis, which is indepen-

dent of the specic selection algorithm, and rather assumes that we have a given class Hs

of mappings from meta-features to a selection decision. Formally, Hs is a class of mappings

from a meta-features value to {0, 1}, that is, for each hs ∈ Hs , hs: : Rk → {0, 1}. hs denes

which features are selected as follows:

f is selected ⇐⇒ hs (u (f )) = 1

where, as usual, u (f ) is the value of the meta-features for feature f. Given the values

of the meta-features of all the features together with hs we get a single feature selection

hypothesis. Therefore, Hs and the set of possible values of meta-features indirectly denes

our feature selection hypothesis class, Hf s . Since we are interested in selecting exactly n

features (n is predened), we use only a subset of Hs where we include only functions that

imply selection of n features .


6 For simplicity, in the analysis we use the VC-dim of Hs
6 Note that one valid way to dene H
s is by applying a threshold on a class of mappings from meta-features
value to feature quality, Q̂ : Rk → R. See, e.g., example 2 at the end of this section.
5.4. Theoretical Analysis 97

without this restriction, which is an upper bound of the VC-dim of the restricted class.

Our goal is to calculate an upper bound on the VC-dimension ([117]) of the joint problem

of feature-selection and classication. To achieve this, we rst derive an upper bound on

|Hf s | as a function of VC-dim (Hs ) and the number of features N.

Lemma 5.3 Let Hs be a class of mappings from the meta-feature space (Rk ) to {0, 1}, and
let Hf s be the induced class of feature selection schemes; the following inequality holds:

( )VC-dim(Hs )
eN
|Hf s | ≤
VC-dim (Hs )

Proof. The above inequality follows directly from the well known fact that a class with
( em )d
VC-dim d cannot give more than
d dierent partitions of a sample of size m (see, for

example, [60] pp. 57).

The next lemma relates the VC dimension of the classication concept class (dc ), the

cardinality if the selection class (|Hf s |) and the VC-dim of the joint learning problem.

Lemma 5.4 Let Hf s be a class of the possible selection schemes for selecting n features out
of N and let Hc be a class of classiers over Rn . Let dc = dc (n) be the VC-dim of Hc . If
dc ≥ 11 then the VC-dim of the combined problem (i.e. choosing (hf s , hc ) ∈ Hf s × Hc ) is
bounded by (dc + log |Hf s | + 1) log dc .

Proof. For a given set of selected features, the possible number of classications of m in-
( )dc
em
stances is upper bounded (see [60] pp. 57). Thus, for the combined learning problem,
dc
( ) dc
the total number of possible classications of m instances is upper bounded by |Hf s | em
dc .
( ) dc
The following chain of inequalities shows that if m = (dc + log |Hf s | + 1) log dc then |Hf s | em
dc <

2m :
98 Chapter 5. Learning to Select Features

( )dc ( )d
e (dc + log |Hf s | + 1) log dc dc log |Hf s | + 1 c
|Hf s | = |Hf s | (e log dc ) 1+
dc dc
1+log e dc
≤ e (|Hf s |) (e log dc ) (5.1)

2+log e dc
≤ (|Hf s |) (e log dc ) (5.2)

2+log e
≤ (|Hf s |) ddc c +1 (5.3)

log dc
≤ ddc c +1 (|Hf s |) (5.4)

(log |Hf s |)
= ddc c +1 dc (5.5)

= 2(dc +1+log |Hf s |) log dc

where we used the following equations / inequalities:

(5.1) (1 + a/d)d ≤ ea ∀a, d > 0

(5.2) here we assume |Hf s | > e, otherwise the lemma is trivial

(5.3) (e log d)d ≤ dd+1 ∀d ≥ 1

(5.4) log dc > 2 (since dc ≥ 11)

(5.5) alog b = blog a ∀a, b > 1

Therefore, (dc + log |Hf s | + 1) log dc is an upper bound on VC-dim of the combined learn-

ing problem.

Now we are ready to state the main theorem of this section.

Theorem 5.3 Let Hs be a class of mappings from the meta-feature space (Rk ) to {0, 1}, let
Hf s be the induced class of feature selection schemes for selecting n out of N features and
let Hc be a class of classiers over Rn . Let dc = dc (n) be the VC-dim of the Hc . If dc ≥ 11,
then the VC-dim of the joint class Hf s × Hc is upper bounded as follows
( )
eN
VC-dim (Hf s × Hc ) ≤ dc + ds log + 1 log dc
ds

where ds is the VC-dim of Hs .

The above theorem follows directly by substituting lemma 5.3 in lemma 5.4.
5.5. Inductive Transfer 99

To illustrate the gain of the above theorem we calculate the bound for a few specic

choices of Hs and Hc :

1. First, we should note that if we do not use meta-features, but consider all the possible

ways to select n out of N features the above bound is replaced by

   
  N  
dc + log   + 1 log dc (5.6)
n

which is very large for reasonable values of N and n.

2. Assuming that both Hs and Hc are classes of linear classiers on Rk and Rn respec-

tively, then ds = k + 1 and dc = n + 1 and we get that the VC of the combined problem

of selection and classication is upper bounded by

O ( (n + k log N ) log n )

If Hc is a class of linear classiers, but we allow any selection of n features the bound

is (by substituting in 5.6):

O ((n + n log N ) log n)

which is much larger if k ¿ n. Thus in the typical case where the number of meta-

features is much smaller than the number of selected features (e.g. in section 5.3.2)

the bound for meta-feature based selection is much smaller.

3. Assuming that the meta-features are binary and Hs is the class of all possible functions

from meta-feature to {0, 1}, then ds = 2k and the bound is

(( ) )
O dc + 2k log N log dc

which is still might be much better than the bound in equation 5.6 if k ¿ log n.

5.5 Inductive Transfer


Here we show how the usage of meta features can be employed for inductive transfer ([10,

110, 16]). As explained in section 1.1.5 of the introduction, inductive transfer refers to the
100 Chapter 5. Learning to Select Features

problem of applying the knowledge learned in one or more tasks to the learning of a new

task (the target task). Here, in the context of feature selection, we interested in better

prediction of feature quality for a new task using its quality in other tasks. We assume

that the dierent tasks use the same set of features. We also assume that during the stage

we required to estimate feature quality we did not yet nd any instances of the new task.

A straightforward prediction is to use the average quality of the feature over the previous

tasks. However we suggest using the quality in previous tasks of other features with similar

properties, i.e. with similar values of meta-features. This make sense as usually not the

exact same features are good for dierent tasks, but rather, good features of dierent tasks

share similar properties. In practice this is done by rst using the training tasks to learn

a regression from meta-features to feature quality (similar to the procedure described in

algorithm 5), and then using the regression to predict the quality of the features for the new

task.

The notion of related tasks arises in many works on inductive transfer. Task is usually

dened as related to the task of interest if using the knowledge gathered in this task improves

performance on the current task. However, this notion is tricky as it may depend on a variety

of parameters such as the choice of the learning algorithm ([16]). Here we show that the

level of improvement depends highly on the kind of knowledge we choose to transfer. It is

much better to transfer which kind of features are good rather than which specic features

are good. This is achieved by using meta-features.

5.5.1 Demonstration on Handwritten Digit Recognition


To demonstrate the use of of meta-features for inductive transfer we consider the following

collection of tasks: a task is dened by a pair of digits (e.g.,


  3 and 7), and the task is to

 10 
classify images of these two digits. Thus we have   = 45 dierent tasks to choose
2
from. To avoid any possible overlap between the training tasks and the testing tasks, we rst

randomly divide the digits 0 . . . 9 into two disjoint sets of ve digits. Then, we choose pairs of

digits for the training tasks from the rst set, and for the testing tasks from the second set.

For each training task we calculate the quality of each feature (pixel) by infogain, and take
5.6. Choosing Good Meta-features 101

the average over all the training tasks. Then, we predict the quality of each feature for the

test task in two dierent ways: (1) directly, as equal to the average quality on the training

tasks. (2) indirectly, by learning a regression from the meta features to the quality. Finally,

we check the correlation coecient between the two dierent predicted qualities and the

real quality, i.e., the infogain calculated using many examples of the test task. To evaluate

the statistical signicance of our results, we repeat this experiment for all possible choices

of training tasks and a test tasks. We try to use two dierent kinds of meta features: exact

(x, y) coordinates and only the distance from the center of the image, r. The results are

presented in Figure 5.5. We can see that the indirect prediction that uses the meta-features

achieves much better prediction, and that using r as a meta feature outperforms using (x, y)

as meta-features. This also suggests that the relatedness of tasks depends on the kind of

knowledge we choose to transfer and on the way we transfer it.

5.6 Choosing Good Meta-features


Choosing good meta-features for a given problem is obviously crucial. It is also worth

inquiring why the problem of nding good meta-features is easier than the problem of nding

good features. As we show in section 5.3.2, in the presence of a huge number of candidate

features it is easier to guess which properties might be indicative of feature quality than

to guess which exact features are good. In this section we give general guidelines on good

choices of meta-features and mappings from meta-feature values to selection of features.

First, note that if there is no correlation between meta-features and quality, the meta-

features are useless. In addition, if any two features with the same value of meta-features

are redundant (highly correlated), we gain almost nothing from using a large set of them.

In general, there is a trade-o between two desired properties:

1. Features with the same value of meta-features have similar quality.

2. Low redundancy between features with the same value of meta-features.

When the number of features we select is small, we should not be overly concerned about

redundancy and rather focus on choosing meta-features that are informative on quality. On
102 Chapter 5. Learning to Select Features

0.75

0.7

0.65

0.6
ρ

0.55

0.5

0.45 Direct estimation


Reg. from MF ( (x,y)−location )
Reg. from MF (radius)
0.4
0 2 4 6 8 10 12
Number of training tasks

Figure 5.5: Inductive Transfer by meta-features for OCR . The correlation


coecient between the predicted and real quality of features as a function of number of
training tasks. The results are averaged over all the possible choices of training tasks and
test task repetitions. It is clear that indirect prediction (using meta-features) signicantly
outperforms the direct prediction.
5.6. Choosing Good Meta-features 103

the other hand, if we want to select many features, redundancy may be dominant, and this

requires our attention. Redundancy can also be tackled by using distribution over meta-

features instead of a single point.

In order to demonstrate the above trade-o we carried out one more experiment using

the MNIST dataset. We used the same kind of features as in section 5.3.2, but this time

without shift-invariance and with xed scatter. The task was to discriminate between 9

and 4 and we used 200 images as the training set and another 200 as the test set. Then

we used Mufasa to select features, where the meta-features were either the (x, y)-location

or the number of inputs. When the meta-feature was (x, y)-location, the distribution of

selecting the features, p (v|u), was uniform in a 4×4 window around the chosen location

(step 2a in Mufasa ). Then we checked the classication error on the test set of a linear SVM

(which uses the selected features). We repeated this experiment for dierent numbers of

7
features . The results are presented in Figure 5.6. When we use a small number of features,

it is better to use the (x, y)-location as a meta-feature whereas when using many features it

is better to use the number of inputs as a meta-feature. This supports our prediction about

the redundancy-homogeneity trade-o. The (x, y)-locations of features are good indicators

of their quality, but features from similar positions tend to be redundant. On the other

hand, constraints on the number of inputs are less predictive of feature quality but do not

cause redundancy.

Rand
(X,Y)
Test error

−1
10 Inputs #

2 3
10 10
Number of features

Figure 5.6: Dierent choices of meta-features. The generalization error as a function


of the number of selected features. The two lines correspond to using dierent
meta-features: (x, y)-location or number of inputs. The results of random selection of
features are also presented.

7 We do not use shift invariance here, thus all the features have the same cost.
104 Chapter 5. Learning to Select Features

5.7 Improving Selection of Training Features

The applications of meta-features presented so far involve predicting the quality of unseen

features. However, the meta-features framework can also be used to improve the estimation

of the quality of features that we do see in the training set. We suggest that instead of

using direct quality estimation, we use some regression function on the meta-features space

(as in algorithm 5). When we have only a few training instances, direct approximation of

the feature quality is noisy, thus we expect that smoothing the direct measure by using

a regression function of the meta-features may improve the approximation. One initial

experiment we conducted in this way shows the potential of this method, but also raises

some diculties.

Once again we used 2000 images of the MNIST dataset with the 784 pixels as input

features, and the (x, y) location as meta-features. We used algorithm 5 with Infogain and

linear regression over RBF (see section 5.2 for details) to produce a prediction of the feature

quality, and compared it to the direct measure of the Infogain in the following way: we cal-

culated the Infogain of each feature using all 2000 instances as a reference quality measure,

and compared its correlation coecient with the two alternatives. The results are presented

in Figure 5.7. It is clear that when the number of training instances is small enough (below

100) the indirect prediction that uses meta-features outperforms the direct measure in pre-

dicting the real Infogain. However, there was no signicant improvement in the error rate of

classication. We believe that this was due to the limitations of Infogain which ignores the

redundancy between the features and the fact that in this choice of meta-features, features

with similar values of meta-features are usually redundant. It might be possible to solve

this problem by a dierent choice of meta-features. Another more fundamental problem is

that if, by chance, a good feature is constant on the training set, our indirect method may

predict correctly that it is good quality feature, but this feature is useless for classication

using the current training set . The question of how to overcome this problem is still open.
5.8. Summary 105

0.98

0.96

0.94
ρ

0.92

0.9
Regression
Direct Measure
0.88
0 100 200 300 400 500 600
Number of training instances

Improving quality estimation from a few training instances for the


Figure 5.7:
OCR problem. The correlation coecient of direct Infogain and prediction using
meta-features with the true Infogain, as a function of the number of training instances. It
is clear that when there are less than 100 training instances, the indirect measure is better.

5.8 Summary
In this chapter we presented a novel approach to feature selection. Instead of just selecting a

set of better features out of a given set, we suggest learning the properties of good features.

We demonstrated how this technique may help in the task of feature extraction, or feature

selection in the presence of a huge number of features and in the setting of inductive transfer.

We showed that when the selection of features is based on meta-features it is possible to

derive better generalization bounds on the combined problem of selection and classication.

We illustrated our approach on a digit recognition problem, but we also expect our methods

to be very useful in many other domains. For an example, in the problem of tissue classi-

cation according to a gene expression array where each gene is one feature, ontology-based

properties may serve as meta-features. In most cases in this domain there are many genes

and very few training instances; therefore standard feature selection methods tend to over-t

and thus yield meaningless results ([31]). A meta-feature based selection can help as it may

reduce the complexity of the class of possible selections.

In section 5.3 we used meta-features to guide feature extraction. Our search for good
106 Chapter 5. Learning to Select Features

features is computationally ecient and has good generalization properties because we do

not examine each individual feature. However, avoiding examination of individual features

may also be considered as a disadvantage since we may include some useless individual

features. This can be solved by using a meta-features guided search as a fast but rough lter

for good features, and then applying more computationally demanding selection methods

that examine each feature individually.

A Notation Table for Chapter 5

The following table summaries the notation and denitions introduced in chapter 5 for quick

reference.
A. Notation Table for Chapter 5 107

Notation Short description Sections

meta-feature a property that describes a feature

k number of meta-features

S (abstract) instances space 5.1, 5.4.1

f a feature. formally, f :S→R

c a classication rule

Sm Sm is a labeled set of instances (a training set) 5.1, 5.2, 5.4.1

ui (f ) the the value of thei's meta-feature on feature f 5.1, 5.3

u (f ) u (f ) = (u1 (f ) , . . . , uk (f )), a vector that describes the feature f 5.1, 5.3, 5.4.1

u a point (vector) in the meta-feature space 5.1, 5.3

Q̂ a mapping from meta-features value to feature quality 5.2, 5.4.2

YM F measured quality of the features (e.g. Infogain) 5.2

XM F XM F (i, j) = the value of the j 's meta-feature on the i's feature 5.2

F, Fj a set of features 5.3

V a random variable indicating which features are selected 5.3

p (v|u) the conditional distribution of V given meta-features values (u) 5.3, 5.6

hc a classication hypothesis 5.4.1

Hc the classication hypothesis class 5.4.1

dc the VC-dimension of Hc 5.4.2

hf s a feature selection hypothesis - says which n features are selected 5.4.1

Hf s the feature selection hypothesis class 5.4.1

hs a mapping from meta-feature space to {0, 1} 5.4.2

Hs class of mappings from meta-feature space to {0, 1} 5.4.2

ds the VC-dimension of Hs

J number of iterations Mufasa (algorithm 6) makes 5.3, 5.4.1

|M F | number of possible dierent values of meta-features 5.4.1


Chapter 6

Epilog

Learning is one of the most fascinating aspects of intelligence. The amazing learning ability of

humans is probably the main thing that distinguishes humans from other animals. Machine

learning tries to theoretically analyze learning processes and to mimic human-like learning

abilities using computers. Many dierent methods have been suggested for learning and it is

widely agreed that given a "good" representation of the data, many of these can give good

results. This observation is consistent with popular saying: asking the question in the right

way is half way to the solution . However, it is less clear how to build a good representation

of the data, i.e., a representation which is both concise and reveals the relevant structure

for the problem. One valid way to do so is rst to measure (or produce) large number

of candidate features which have a high probability of containing "good features" but also

contain many weakly relevant or noisy features which mask the better features, and then

use feature selection techniques to select only some of them. Thus, feature selection, the

task of selecting a small number of features that enable ecient learning, deals with the

fundamental question of nding good representations of data.

One thing to keep in mind is that feature selection is also a learning process in itself, and

therefore is exposed to over-t. Thus, care should always been taken in analyzing the results,

especially when the number of candidate features is huge and we only have a small number

of training instances. In this case the noise of the measurement of feature quality may be

108
109

too high and any selection algorithm that uses only the data will fail. Such a situation is

common in the eld of gene expression, as described by Ein-dor et al. [31]. Therefore for

these cases we need to adopt a dierent approach. One possible way is to use properties of

the features, as we suggest in chapter 5. Another option is to use information extracted in

other related tasks, as done in Inductive Transfer.

We believe that the selection algorithms we suggested in this thesis can work well on many

problems, but it is important to understand that any selection algorithm is based on some

assumptions. If these assumptions are violated the algorithm can fail. On the other hand,

if a stronger assumption holds, another algorithm that assumes this stronger assumption

might outperform the rst one. For example, a method that ranks individual features by

assigning a score to each feature independently assumes that complex dependency on sets

of features does not exist or is negligible. This assumption narrows the selection hypothesis

space, and therefore allows for generalization using fewer instances. Thus, if this assumption

is true, we would expect such a ranking to work better than methods that do not assume

this, i.e. methods that consider subsets and are able to reveal complex dependencies (as

these methods look in a larger hypothesis space). However, we cannot expect such rankers

to work well when this independency assumption is not true. To demonstration this, the

performance of SKS and RGS in gure 4.1 in chapter 4 provide a good example. SKS

and RGS use the same evaluation function, but SKS considers each feature independently

whereas RGS considers all the features at the same time. We can see that SKS outperforms

RGS on problem (a), where there are no complex dependencies, but completely fails on

problem (d), where complex dependencies exist.

Another example can be seen in chapter 3. Simba and G-ip are based on margins, and

therefore implicitly assume that the Euclidean distance is meaningful for the given data. In

the data we used for our experiments this assumption was true and they outperformed all

the alternatives. However on other kinds of data where this assumption is violated, Infogain,

which does not use this assumption, might outperform Simba and G-ip. Thus, there is no

one selection algorithm which is the ideal for all problems. When data of a new kind emerge,

we need to try many selection algorithms to nd which is most suitable. Of course, we can

also use some prior knowledge on the data to predict which selection algorithm is most
110 Chapter 6. Epilog

adequate, if such prior knowledge exists. By analyzing the success of the dierent kinds of

algorithms we can also discover some important properties of the data. An example of such

an analysis is given in the chapter 4. In section 4.4 of this chapter we use the fact that RGS

works better than SKS on the motor control task to suggest that complex dependencies

upon sets of neurons or time bins are an important property of the neural code.
List of Publications

In order to keep this document reasonably sized, only a subset of the work I have done during

my studies is presented in this dissertation. Here is a complete list of my publications.

Journal Papers

• R. Gilad-Bachrach, A. Navot, and N. Tishby, A Study of The Information Bottleneck


Method and its Relationship to Classical Problems. Under review for publication in

IEEE transaction on information theory.

• E. Krupka, A. Navot and N. Tishby, Learning to Select Features using their Properties.
Submitted to Journal of Machine Learning Research (JMLR).

Chapters in Refereed Books


• A. Navot, R. Gilad-Bachrach, Y. Navot and N. Tishby. Is Feature Selection Still

Necessary? In C. Saunders, M. Grobelnik, S. Gunn and J. Shawe-Taylor, editors,

Subspace, Latent Structure and Feature Selection. Springer-Verlag, 2006.

• R. Gilad-Bachrach, A. Navot and N. Tishby Large margin principles for feature se-

lection. In Feature extraction, foundations and applications, I. Guyon, S. Gunn, M.

Nikravesh and L. Zadeh (eds.) , Springer-Verlag 2006.

• R. Gilad-Bachrach, A. Navot and N. Tishby Connections with some classic IT prob-

lems. In Information Bottlenecks and Distortions: The emergence or relevant structure


from data, N. Tishby and T. Gideon (eds.) MIT press (in preparation).

111
112 Chapter 6. Epilog

Refereed Conferences
• R. Gilad-Bachrach, A. Navot and N. Tishby, Query By Committee made real, in Pro-
ceedings of the 19th Conference on Neural Information Processing Systems (NIPS),

2005.

• R. Gilad-Bachrach, A. Navot and N. Tishby, Bayes and Tukey meet at the center point,
in Proceedings of the 17
th Conference on Learning Theory (COLT), 2004.

• R. Gilad-Bachrach, A. Navot and N.Tishby, Margin based feature selection - theory and
algorithms, in Proceedings of the 21st International Conference on Machine Learning
(ICML), 2004.

• R. Gilad-Bachrach, A. Navot, and N. Tishby, An information theoretic tradeo between


complexity and accuracy, in Proceedings of the 16th Conference on Learning Theory

(COLT), pp. 595-609, 2003.

• K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, Margin analysis of the lvq


algorithm, in Proceedings of the 16th Conference on Neural Information Processing

Systems (NIPS), 2002.

• A. Navot, L. Shpigelman, N. Tishby and E. Vaadia, Nearest Neighbor Based Feature


Selection for Regression and its Application to Neural Activity. Proc. 20th Conference
on Neural Information Processing Systems, 2006

Technical Reports

• R. Gilad-Bachrach, A. Navot, and N. Tishby, Kernel query by committee (KQBC),


technical report 2003-88, Leibniz Center, the Hebrew University, 2003.
Bibliography

[1] D. Aha and R. Bankert. Feature selection for case-based classication of cloud

types: An empirical comparison, 1994.

[2] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms.

Machine Learning, 6:3766, 1991.


[3] R. F. Ahlswede and J. Korner. Source coding with side information and a

converse for degraded broadcast channels. IEEE transaction on information


theory, 21(6):629637, November 1975.
[4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In

Proceedings of the Ninth National Conference on Articial Inte lligence (AAAI-


91), volume 2, pages 547552, Anaheim, California, 1991. AAAI Press.
[5] T. W. Anderson. Classication by multivariate analysis. Psychometria, 16:3150,
1951.

[6] D. Angluin. Queries and concept learning. Machine Learning, 2:319342, 1988.
[7] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11.
[8] Jerzy Bala, J. Huang, Haleh Vafaie, Kenneth DeJong, and Harry Wechsler. Hy-

brid learning using genetic algorithms and decision trees for pattern classica-

tion. In IJCAI (1), pages 719724, 1995.


[9] P. Bartlett. The size of the weights is more important than the size of the

network. IEEE Transactions on Information Theory, 44(2):525536, 1998.


[10] Jonathan Baxter. A bayesian/information theoretic model of learning to learn

viamultiple task sampling. Mach. Learn., 28(1):739, 1997.


[11] A. Bell and T. Sejnowski. An information-maximization approach to blind sep-

aration and blind deconvolution. Neural Computation, 7:11291159, 1995.


[12] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University

Press, 1961.

113
114 BIBLIOGRAPHY

[13] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and

the vapnik-chervonenkis dimension. Journal of the ACM, 100:157184, 1989.


[14] B. Boser, I. Guyon, and V. Vapnik. Optimal margin classiers. In Fifth Annual
Workshop on Computational Learning Theory, pages 144152, 1992.
[15] C. Campbell, N. Cristianini, and A. Smola. Query learning with large mar-

gin classiers. In Proceedings of the 17th International Conference on Machine


Learning (ICML), 2000.
[16] Rich Caruana. Multitask learning. Mach. Learn., 28(1):4175, 1997.
[17] Rich Caruana and Dayne Freitag. Greedy attribute selection. In International
Conference on Machine Learning, pages 2836, 1994.
[18] G. C. Cawley. MATLAB support vector machine toolbox (v0.55β ) [
http://theoval.sys.uea.ac.uk/gcc/svm/toolbox]. University of East An-

glia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ, 2000.

[19] Kevin J. Cherkauer and Jude W. Shavlik. Growing simpler decision trees to

facilitate knowledge discovery. In Knowledge Discovery and Data Mining, pages


315318, 1996.

[20] Shay Cohen, Eytan Ruppin, and Gideon Dror. Feature selection based on the

shapley value. In IJCAI, pages 665670, 2005.


[21] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries

and selective sampling. Advanced in Neural Information Processing Systems 2,


1990.

[22] T.M. Cover and P.E. Hart. Nearest neighbor pattern classier. IEEE Transac-
tions on Information Theory, 13:2127, 1967.
[23] K. Crammer. Mcsvm_1.0: C code for multiclass svm, 2003.

http://www.cis.upenn.edu/∼crammer.

[24] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin analysis of

Proc. 17'th Conference on Neural Information Processing


the lvq algorithm. In

Systems (NIPS), 2002.


[25] Ofer Dekel and Yoram Singer. Data-driven online to batch conversions. In

Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information


Processing Systems 18. MIT Press, Cambridge, MA, 2006.
[26] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum-likelihood from incomplete

data via the em algorithm. Journal of the Royal Statistical Society, B, 39:138,
1977.
BIBLIOGRAPHY 115

[27] L. Devroye. The uniform convergence of nearest neighbor regression function es-

timators and their application in optimization. IEEE transactions in information


theory, 24(2), 1978.
[28] L. Devroye, L. Gyor, and G. Lugosi. A Probabilistic Theory of Pattern Recog-
nition. Springer, New York, 1996.

[29] R.O. Duda, P.E. Hart, and Stork D.G. Pattern Classication 2'nd edition. Wiley-
Interscience Publication, 2000.

[30] L. Kasatkina E. Kussul, T. Baidyk and V. Lukovich. Rosenblatt perceptrons

for handwritten digit recognition. In Int'l Joint Conference on Neural Networks,


pages 151620, 2001.

[31] L. Ein-Dor, O. Zuk, and E. Domany. Thousands of samples are needed to

generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci
U S A, 103(15):59235928, April 2006.
[32] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination:

Consistency properties. Technical Report 4, USAF school of Aviation Medicine,

1951.

[33] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line

learning and an application to boosting. Journal of Computer and System Sci-


ences, 55(1):119139, 1997.
[34] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the

query by committee algorithm. Machine Learning, 28:133168, 1997.


[35] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory

data analysis. IEEE Trans. of Computers, 23(9):881890, 1974.


[36] Jerome H. Friedman. Exploratory projection pursuit. Journal of the American
Statistical Association, 82(397):249266, 1987.
[37] J. H. Gennari, P. Langley, and D. Fisher. Models of incfremental concept for-

mation. Articial Intelligence, 40:1161, 1989.


[38] R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, Hebrew University, 2006.

[39] R. Gilad-Bachrach, A. Navot, and N. Tishby. An information theoretic tradeo

between complexity and accuracy. In Proc. 16'th Conference on Computational


Theory (COLT), pages 595609, 2003.
[40] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection -

theory and algorithms. In Proc. 21st (ICML), pages 337344, 2004.


116 BIBLIOGRAPHY

[41] R. Gilad-Bachrach, A. Navot, and N. Tishby. Large margin principles for feature

selection. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature


extraction, foundations and applications. Springer, 2006.

[42] R. Greiner. Using value of information to learn and classify under hard bud-

gets. NIPS 2005 Workshop on Value of Information in Inference, Learning and

Decision-Making, 2005.

[43] I. Guyon and A. Elissee. An introduction to variable and feature selection.

Journal of Machine Learnig Research, pages 11571182, Mar 2003.

[44] I. Guyon and S. Gunn. Nips feature selection challenge. http://www.nipsfsc.


ecs.soton.ac.uk/, 2003.

[45] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature extraction, foundations


and applications. Springer, 2006.

[46] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.


Springer, 2001.

[47] P. M. Hofman, J. G. A. Van Riswick, and A. Van Opstal. Relearning sound

localization with new ears. Nature Neuroscience, 1:417421, 1998.

[48] j. Holland. Adaption in Neural and Articial Systems. University of Michigen

Press, 1975.

[49] J. Hua, Z. Xiong, and E. R. Dougherty. Determination of the optimal number

of features for quadratic discriminant analysis via normal approximation to the

discriminant distribution. Pattern Recognition, 38(3):403421, March 2005.

[50] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435475, 1985.

[51] J.P. Ignizio. Introduction to Expert Systems. McGraw-Hill, Inc., USA, 1991.

[52] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing

the curse of dimensionality. In Proceedings of the 30th ACM Symposium on the


Theory of Computing, pages 604613, 1998.

[53] A. K. Jain and W. G. Waller. On the monotonicity of the performance of bayesian

classiers. IEEE transactions on Information Theory, 24(3):392394, May 1978.

[54] A. K. Jain and W. G. Waller. On the optimal number of features in the classi-

cation of multivariate gaussian data. Pattern Recognition, 10:365374, 1978.


BIBLIOGRAPHY 117

[55] George H. John, Ron Kohavi, and Karl Peger. Irrelevant features

and the subset selection problem. In International Conference on Ma-


chine Learning, pages 121129, 1994. Journal version in AIJ, available at

http://citeseer.nj.nec.com/13663.html.

[56] I.T. Jolliee. Principal Component Analysis. Springer Varlag, 1986.

[57] M.C. Jones and R. Sibson. What is projection pursuit ? J. of the Royal Statistical
Society, ser. A, 150:136, 1987.
[58] M. W. Kadous and C. Sammut. Classication of multivariate time series and

structured data using constructive induction. Mach. Learn., 58:179216, 2005.


[59] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A

survey. Journal of Articial Intelligence Research, 4:237285, 1996.


[60] Michael J. Kearns and Umesh V. Vazirani. An introduction to computational
learning theory. MIT Press, Cambridge, MA, USA, 1994.

[61] K. Kira and L. Rendell. A practical approach to feature selection. In Proc. 9th
International Workshop on Machine Learning, pages 249256, 1992.
[62] R. Kohavi and G.H. John. Wrapper for feature subset selection. Articial Intel-
ligence, 97(1-2):273324, 1997.
[63] T. Kohonen. Self-Organizing Maps. Springer-Verlag, 1995.

[64] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Inter-
national Conference on Machine Learning, pages 284292, 1996.
[65] I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In

Proc. ECML, pages 171182, 1994.


[66] E. Krupka and N. Tishby. Generalization in clustering with unobserved features.

In NIPS, 2006.
[67] J. B. Kruskal. Articial Intelligence: A Modern Approach. Prentice Hall, 2
nd

edition edition, 2002.

[68] K. J. Lang and E. B. Baum. Query learning can work poorly when a human

oracle is used. In Proceedings of the International Joint Conference on Neural


Networks, pages 335340, 1992.
[69] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,

and L.J. Jackel. Backpropagation applied to handwritten zip code recognition.

Neural Computation, 1:541551, 1989.


118 BIBLIOGRAPHY

[70] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied

to document recognition. Proceedings of the IEEE, 86(11):22782324, November


1998.

[71] K. Levi, M. Fink, and Y. Weiss. Learning from a small number of training

examples by exploiting object categories. LCVPR04 workshop on Learning in

Computer Vision, 2004.

[72] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Al-


gorithms. PhD thesis, University of California Santa Cruz, 1989.

[73] H. Liu and R. Sutiono. Feature selection and discretization. IEEE Trans Knowl-
edge and Data Eng, 9.
[74] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretiza-

tion: An enabling technique. Data Min. Knowl. Discov., 6(4):393423, 2002.


[75] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-bayes classi-

ers. In UAI, 2003.


[76] George F. Luger. Articial Intelligence: Structures and Strategies for Complex
Problem Solving. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,

USA, 2001.

[77] J. MacQueen. Some methods for classication and analysis of multivariate ob-

servations. Proceedings of the Fifth Berkeley Symposium on Mathematical


In

statistics and probability, pages 281297. University of California Press, 1967.


[78] T. Marill and D Green. On the eectiveness of receptors in recognition systems.

IEEE Transactions on Information Theory, 9:1117, 1963.


[79] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners.

In Articial Intelligence Review, volume 11, pages 193225, April 1997.


[80] A.M. Martinez and R. Benavente. The ar face database. Technical report, CVC

Tech. Rep. #24, 1998.

[81] L. Mason, P. Bartlett, and J. Baxter. Direct optimization of margins improves

generalization in combined classier. Advances in Neural Information Processing


Systems, 11:288294, 1999.
[82] A.J. Miller. Subset Selection in Regression. Chapman and Hall, 1990.

[83] Kensaku Mori and Hiroshi Nagao andYoshihiro Yoshihara. The olfactory bulb:

Coding and processing of odor molecule information. Science, 286(5440):711

715, 1999.
BIBLIOGRAPHY 119

[84] A. Navot, R. Gilad-Bachrach, Y. Navot, and N. Tishby. Is feature selection

still necessary? In C. Saunders, M. Grobelnik, S. Gunn, and J. Shawe-Taylor,

editors, Subspace, Latent Structure and Feature Selection. Springer-Verlag, 2006.


[85] A. Navot, L. Shpigelman, N. Tishby, and E. Vaadia. Nearest neighbor based

feature selection for regression and its application to neural activity. In Proc.
20 th
Conference on Neural Information Processing Systems (NIPS), forthcoming
2006.

[86] A. Ng. Feature selection, l1 vs l2 regularization and rotational invariance. In

Proc. 21st International Conference on Machine Learning (ICML), pages 615

622, 2004.

[87] C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms


and Complexity. Prentice Hall, 1982.

[88] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity

in motor cortex reects learning of local visuomotor skills. Nature Neuroscience,


6(8):882890, August 2003.

[89] J. Pearl. Probablistic Reasonong in Intelligent Systems. Morgan Kaufmann, San

Mateo, CA, 1988.

[90] A. Perera, T. Yamanaka, Gutierrez-Galvez A., B. Raman, and R. Gutierrez-

Osuna. A dimensionality-reduction technique inspired by receptor convergence

in the olfactory system. Sensors and Actuators B, 116(1-2):1722, 2006.


[91] Bernhard Pfahringer. Compression-based feature subset selection, 1995.

[92] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan-

nery. Numerical Recipes in C: The Art of Scientic Computing. Cambridge

University Press, Cambridge, UK, 1988.

[93] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, 1:81


106, 1986.

[94] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[95] R. Raina, A.Y. Ng, and D. Koller. Constructing informative priors using transfer

learning. In Proc. Twenty-Third International Conference on Machine Learning,


2006.

[96] S. Raudys and V. Pikelis. On dimensionality, sample size, classication error and

complexity of classication algorithm in pattern recognition. IEEE Transactions


on pattern analysis and machine intelligence, PAMI-2(3):242252, May 1980.
120 BIBLIOGRAPHY

[97] J. Rissanen. Modeling by shortest data description. Automatica, 14:465471,

1978.

[98] F. Rosenblatt. The perceptron: A probabilistic model for information storage

and organization in the brain. Psychological Review, 65(6):386408, 1958.


[99] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear

embedding. Science, 290:23232326, 12 2000.


[100] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin :

A new explanation for the eectiveness of voting methods. Annals of Statistics,


1998.

[101] Thomas Serre, Maximilian Riesenhuber, Jennifer Louie, and Tomaso Poggio. On

the role of object-specic features for real world object recognition in biologi-

cal vision. In BMCV '02: Proceedings of the Second International Workshop

on Biologically Motivated Computer Vision, pages 387397, London, UK, 2002.


Springer-Verlag.

[102] C. E. Shannon. A mathematical theory of communication. Bell System Technical


Journal, 27, July and October 1948.
[103] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural

risk minimization over data-dependent hierarchies. IEEE transactions on Infor-


mation Theory, 44(5):19261940, 1998.
[104] P. Simard, Y. LeCun, J. S. Denker, and B. Victorri. Transformation invariance

in pattern recognition-tangent distance and tangent propagation. In Neural Net-


works: Tricks of the Trade, pages 23927, 1996.
[105] P. Y. Simard, Y. A. Le Cun, and J. Denker. Ecient pattern recognition using

a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles,

editors, Advances in Neural Information Processing Systems, volume 5, pages

5058. Morgan Kaufmann, 1993.

[106] David B. Skalak. Prototype and feature selection by sampling and random muta-

tion hill climbing algorithms. In International Conference on Machine Learning,


pages 293301, 1994.

[107] G. W. Snedecorand and W. G. Cochran. Statistical Methods. Iowa State Uni-

versity Press, Berlin, Heidelberg, New York, 1989.

[108] B. Taskar, M. F. Wong, and D. Koller. Learning on the test data: Leveraging

unseen features. In ICML, 2003.


BIBLIOGRAPHY 121

[109] D. M. Taylor, S. I. Tillery, and A. B. Schwartz. Direct cortical control of 3d

neuroprosthetic devices. Science, 296(7):18291832, 2002.


[110] S. Thrun. Is learning the n-th thing any easier than learning the rst? In Proc.
15th Conference on Neural Information Processing Systems (NIPS) 8, pages

640646. MIT Press, 1996.

[111] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In

Proc. 37th Annual Allerton Conf. on Communication, Control and Computing,


pages 368377, 1999.

[112] Godfried Toussaint. Proximity graphs for nearest neighbor decision rules: Recent

progress.

[113] G. V. Trunk. A problem of dimensionality: a simple example.IEEE Transactions


on pattern analysis and machine intelligence, PAMI-1(3):306307, July 1979.
[114] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermediate com-

plexity and their use in classication. Nature Neuroscience, 5(7):682687, 2002.


[115] H. Vafaie and K. DeJong. Robust feature selection algorithms. Proc. 5th Intl.
Conf. on Tools with Artical Intelligence, pages 356363, 1993.
[116] V. Vapnik. The Nature Of Statistical Learning Theory. Springer-Verlag, 1995.

[117] V. Vapnik. Statistical Learning Theory. Wiley, 1998.

[118] V. Vapnik and A. Y. Chervonenkis. On the uniform covergence of relative fre-

quencies of events to their probabilities. Theory of Probability and its Applica-


tions, 16(2):264280, 1971.
[119] J. Weston, A. Elissee, G. BakIr, and F. Sinz. The spider, 2004.

http://www.kyb.tuebingen.mpg.de/bs/people/spider/.

[120] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik.

Feature selection for SVMs. In Proc. 15th NIPS, pages 668674, 2000.
[121] A. D. Wyner. On source coding with side information at the decoder. IEEE
transaction on information theory, 21(3):294300, May 1975.
122 BIBLIOGRAPHY
ziaeyig dcinla zeipekz zxiga lr

diteqelitl xehwec x`ez zlaw myl xeaig


z`n

zeap xin`

e'qyz zpya milyexia zixard dhiqxaipe`d hpql ybed


II
III

iayiz ilztp xeqtext ly eziigpda dzyrp ef dcear


IV
xivwz

zeipekz zxigaa ,hxtae ,ziaeyig dcinla zeipekz zxiga ly ly mipey mihwtq`a dpc ef dfz

xvw `eanae ziaeyig dcinl `id dn lr xvw xaqda ligzp df xivwza okl .dgpen dcinl xear

.dfza zebvend zeifkxnd zeycgd ze`vezd z` dxvwa bivp jynda .zipekzd zxiga megzl

ziaeyig dcinl
millk zwqde dcinl ly miineyiie miinzixebl` ,miihxe`iz mihaida zwqer ziaeyig dcinl

dpekn zepal miqpn ep`y `id dpeekd ze`nbecn dcinl epxne`a ,blfnd dvw lr .ze`nbecn

xnelk) cneld llk jxca .ze`nbeca diitv i"r dniyn rval cenll zlbeqnd (k"ca aygn zpkez)

-gz zzl xyt`nd mlerd ly lcen zepal icka ze`nbec ly oeni` zveawa ynzyn (dpeknd

-edy miweg sqe`a yeniy i"r zeifgz zzl zlbeqnd dpkezl cebipa df .mlerd lr zepin` zeif

cenll dkixv dpeknd ziaeyig dcinla okl .(zizek`ln dpia ly ziq`lwd dyibd) y`xn excb

:l"fg ly dpyid dxin`a dwac ziaeyig dcinl ly dyibd df oaenae ,`id "dpeiqpn"

oeiqip lrak mkg oi`

-nzn xen`k ep` ,ziaeyig dcinl ly megzd zgz mixwgpd mipey dcinl ilcen xtqn mpyi

lk ea biiezn mbcn oeni` zveawk lawn cneld dgpen dcinla .dgpen dcinla ef dceara micw

z` xyt`d lkk wiecna zefgl gilvdl `id cneld ly ezxhn .(zieez ,mvr) dxevdn bef `id `nbec

zihwxhqa` zeyi zeidl leki mvrd illk ote`a .mcew d`x `l `ed eze` ycg mvr ly zieezd

-xtqn ly xehwe i"r bvein mvrdy migipn ziaeyig dcinla k"ca j` ,hqwh e` dpenz ,mc` oebk

ly xehwe i"r bvein mvrd okle ,(feature) zipekz z`xwp df xehweea dh`picxe`ew lk .miiynn mi

,dtivx e` (classication) beiz ziira epl yiy xn`p f`e ,zixebhw zeidl dleki zieezd .zeipekz

caln) beiz zeiraa onfd aex cwnzp df xeaiga .(regression) diqxbx ziira epl yiy xn`p f`e

V
VI

epl reci epi`y (dxhnd byen `xwpd) reaw llk it lr dyrp beizdy `id k"ca dgpdd .(4 wxta

df llk axwiy (dxryd `xwpd) beiz llk `evnl icka oeni`d zveawa ynzydl `id epzxhne

.ahid

`ed aeh dnk xnelk ,cnlpd llkd ly dllkdd zleki `id cnel ly ezeki`l zlaewnd dcind

dcinla micwnzn ep` .cenild onfa d`x `l cneldy ycg mvr lr dxhnd byen ly ekxr z` dfeg

ze`nbecy migipn ea [811 ,31] Probably Approximately Correct lcena hxtae ziaihwecpi`

beiz llk ly dllkdd zleki o`k .(iid) dreaw zebltzd jezn ielz izla ote`a enbcp oeni`d

izla ote`a dnbcpy dycg `nbec lr drhi `edy iekiqd `idy ely dllkdd z`iby i"r zccnp

oeni`d z`ibya lkzqp okle df lceb zexiyi cecnl lkep `l k"ca z`f mr .zebltzd dze`n ielz

.oeni`d ze`nbec llk jezn dreh cnlpd llkd odilr oeni`d ze`nbec ly iqgid owlgk zxcbend

eli` zgz `evnl xnelk ,(dcinll ozip) cinl dn oiit`l dqpn ziaeyig dcinl ihxe`izd cva

zelabnd lr zepaez zzl ieyr dfk oeit` .dphw dllkd z`iby dgihan dphw oeni` z`iby mi`pz

dleki eply dxrydd xnelk ,zelabn mey lihp `l m`y oaen .ziivl dkixv zcnel zkxrn lk odl

dllkdd z`ibyl oeni`d z`iby oia xrtd z` meqgl ozip `l ,didiy lkk jaeqn ,xac lk zeidl

mlern dxgap eply dxryddy `id zlaewnd dgpdd .odylk zeliabn zegpd gipdl miaiig okle

ly divwpetk xrtd z` meqgl ozip df dxwna ."ziqgi heyt"e oezp (zexryd zwlgn) zexryd

dheyt zexrydd zwlgny lkky jk ,oeni`d ze`nbec xtqn lye zexrydd zwlgn zeikeaiq

oexwird ly dnbcd `id ef d`vez .xrtd lr mqg eze` gihadl icka ze`nbec zegt jxhvp ,xzei

: Occam's Razor ly

entia non sunt multiplicanda praeter necessitatem

yxcpl xarn zeltkeyn zeidl zekixv `l zeieyi :l mbxziny

xzeia aehd `ed xzeia heytd xaqdd :k ycgn gqepn zeaexw mizirle

daeh dxryd `evnl milbeqnd minzixebl` `evnl dqpn ziaeyig dcinl inzixebl`d cva

mxear minzixebl` sicrp il`ici` ote`a .oeni`d zveaw ozpida (dxhnd byen z` aeh zaxwnd)

heyt"y miihqxei minzixebl` mb z`f mr .d`vezd weic lre dvixd onf lr minqg gikedl ozip

gikedl ozip dxear dtelgdn xzei aeh micaer md lreta mizirle megza icnl mivetp "micaer

.zeitivtq zeniynl minzixebl` mi`zdl dqpn ziaeyig dcinl ineyiid cva .zepey zepekz

mineyil ze`nbec dnk .diiyrza ode dincw`a od zeax zeniynl miynyn dcinl inzixebl`

.i`xy` iqihxka zeinxz iedif ,zi`etx dpga` ,mihqwh oein ,mitevxt iedif :od
VII

beviid ziira
minvry migipn dcinld ilcen aexa .minvrd z` bviil ji` `id ziaeyig dcinla zifkxn dl`y

ligzn dcinld jildz ly gezipde ,(edylk iteq cnin `ed N xy`k) RN a mixehwek mipezp

ebiyi mixiaqd dcinld inzixebl` aex ,aeh bevii ozpiday jk lr dagx dnkqd dpyi .ef dcewpn

mirevia biydl deewz oi` rexb bevii xgap m` ipy cvn .mi`zn oepeek xg`l mixiaq mirevia

ly xehwe ici lr (zepenz lynl) minvrd ly aeh bevii mipea ji` dl`yd zl`yp okl .miaeh

zxiga .jci`n epiptl zgpend diral zernyn lrae cgn izivnz zeidl jixv aeh bevii ?mixtqn

df zeipekz sqe` ,zeizin` zeiraa .mvr lk xear eccniy zeipekz ly sqe` xegal dzrenyn bevi

bevii `evnl ozip m`d dl`yd zl`yp j` .ipci ote`a ihpeelxd megza dgnen ici lr xgap k"ca

.oeni`d zveawa yeniy jez ihnehe` ote`a aeh

(xviil e`) cecnla ligzp df dxwna .zeipekz zxiga zxfra `id ef dira sewzl mikxcd zg`

-xebl`a ynzyp f`e ,zeaeh zeipekz mb ze`agzn odipiay gipdl xiaqy jk ,zeipekz ce`n daxd

deedzy xnelk ,miaeh mirevia zxyt`nd dphw dveaw zz `evnl icka zeipekz zxigal mzi

cnin zcxed inzixebl`a yeniy ipt lr zeipekz zxigaa yeniy ly miixwird zepexzid .aeh bevii

zeipekz cizra cecnl jxevd jqkpy jka) oekqige d`vezd ly xzei dgep geprt zleki md millk

.(exgap `ly

zipekz zxiga
xy`k ,zeipekz ly ce`n lecb zxtqn ici lr mibvein minvrd ,mixwndn daxda ,dgpen dcinla

zxiga .wiecn iefig xvil zlekia mirbet s`e ,zeibzd ly iefig jxevl zevegp `l odn lecb wlg

iefig xyt`nd zeipekzd llk jezn zeipekz ly (ohw k"ca) xtqn zxiga ly dniynd `id zeipekz

:md zeipekz zxiga revial miixwird miripnd rax` .dxhnd zeibz ly aeh

mipezp lr dvxdl mipzip `l heyt miax dcinl inzixebl` .ziaeyigd zeikeaiqd xetiy .1

aly .mdly ddeabd ziaeyigd zeikeaiqd llba ,zeipekz ly mevr xtqn ci lr mibveind

.ef dira xeztl leki zeipekz zxiga ly micwn

zeipekzd lk z` (iefigd alya) cizra cecnl jxevd z` zrpen zeipekz zxiga .ilklk oekqg .2

leki minieqn mixwnae mevr zeidl leki df oekqg .mzcicn zelr zkqgp jkae exgap `ly

.miyil miyi izlan oiiprd lk z` jetdl


VIII

llkd ly iefigd zleki z` xtyl mb dleki zeipekz zxiga mixwn daxda .weicd xetiy .3

epl mirecid xzeia miaehd dcinld inzixebl`d mb .yrxl ze`d qgi xetiy ici lr cnlpd

`l hrnk e` zeihpeelx `l zeipekz ly mevr xtqn ly mzegkep lr xabzdl mileki `l meik

ote`a weicd z` xtyl leki zeipekz zxiga ly micwn crv el` mixwna ,okl .zeihpeelx

.ihnxc

daxda .xeztl miqpn ep`y dirad zpada xefrl dleki exgapy zeipekzd zedf .dirad zpad .4

aeygp lynl .envr beizd weicn xzei daeyg zeaeygd zeipekzd lr riavdl zlekid mixwn

mikxc cer opyiy cera .mipey mipb ly iehia zenx it lr (dleg `l/dleg) i`etx oega` lr

dlgnd oepbpn zpada xefrl dleki iefigl miaeygy mipbd zedf ,`l e` dleg mc` m` zrcl

.zetexz gezitae

4 e 2 miripn iabl .zeipekz iaexn mipezp ly dliri dcinla aeyg aikxn `id zeipekz zxiga okl

.zeillk cnin zcxed zehiy ipt lr oexzi yi zeipekz zxigal ,lirl

eli`e zeihpeelx zeipekz el` iedifk dniynd z` mixicbn zeipekz zxiga lr mixwgn daxda

zipekz ly dnexzdy oeeikn .zeihpeelx zxcbd ly dirad z` dlrn ef dxcbd .zeihpeelx opi`

epzip zepey zexcbde ,df byen xicbdl heyt `l llk ,miynzyn ep` ztqep zeipekz eli`a dielz

zepekzd zveawa zexag zxxeb cinz `l zeihpeelx el` zexcbd lk itl j` .mipey mixweg ici lr

dveaw-zz zxiga" k zeipekzd zxiga zniyn z` xicbdl miticrn ep` okl .jtidle zilnihte`d

dirad zpad `id dxhndyk ,z`f mr ."dxhnd zeibz ly aeh iefig zxyt`nd zeipekz ly dphw

hrnk zeizin` zeiraa j` .daeyg `id zipekz lk ly zeihpeelxd zcin zrici ,(lirl 4 ripn)

.zipekz lk ly dnexzd zcin lr xacl sicr okle ,zihpeelx `l oihelgl dpi` zipekz mrt s`

lceb `ed ilty jxr .dzeaiygl ccnk zipekz ly (Shapley) ilty jxr z` erivd [20] eixage odk

.izveaw wgyna owgy ly enexz zcicnl ynyn `ed my ,miwgynd zxezn gewld

mbe ziaeyigd dcinld megza mb zeipekz zxigal mipyd jyna erved miax minzixebl`

miwleg mdn miax j` ,mdipia ce`n mipey minzixebl`dn wlg .dwihqihhqd megza xwgna

zeipekz ly zeveawl oeiv zpzepd dkxrd zivwpet dpyi df dpana .oldl x`zpy illkd dpand z`

ly dygndl 1.1 xei` d`x .deab oeiv mr zeipekz zveaw yetigl ynynd yetig mzixebl` epyie

zthrnd lcen) mieqn dcinl mzixebl` lr zqqean zeidl dleki dkxrdd zivwpet .df dpan

aexa .([26] oepiqd lcen) iefigl zipekz lk ly dznexz iefigl zillk dcin idyefi` lr e` ,([26]

lr cnlpd llkd zwica ici lr zkxreyn zeipekz zveaw ly dzeki` ,mitherd minzixebl`d
IX

lcebd lr zelret ody jka `ed df beqn zehiy ly ixwird oexzid .(validation set) dkxrd zveaw

onfa `ed oexqigd .dcinll ynzyp ea mzixebl`d ly iefigd zeki` - epze` oiiprn zn`ay

oepiqd inzixebl`n miax .dcinld mzixebl` z` ycgn uixdl yi crv lkay oeeikn ,jex` dvix

divnxetpi` e` divlxew mcwn ,(beizd ozpida zipekzd ly) dpzen zepey oebk milcba miynzyn

.zeki`l millk miccnk ztzeyn

okle ixyt` epi` zeipekz ly zeixwird zeveawd izz lk ipt lr `vnn yetig ,dxwn lka

.zeiq`lw zehiy izy wx rbxk xikfp .zepey zewihqixeia zeynzynd ,yetig zehiya miynzyn

miligzn dnicw dxigaa .(backward elimination) xeg`l dxqde (forward selection) dnicw dxiga

z` mitiqen crv lka .ipcng ote`a zg` zipekz mitiqen crv lkae dwix zeipekz zveaw mr

lk zveawa mligzn xeg`l dxqda .dkxrdd zivwpet jxra iaxn lecibl d`iand zipekzd

ly dkxra (zixrfn dphwdl e`) iaxn lecibl d`iand zipekzd z` mixiqn crv lkae zeipekzd

.lecb zeipekzd xtqn xy`k jex` dvix onf `ed ef dyib ly ixwird dpexqg .dkxrdd zivwpet

zxiga jxevl erved miax minzixebl` ,epxkfd xaky itk .oaenk ok mb ixyt` miizyd oia aeliy

miwcea heyt mwlg .yetigd ote`a ode dkxrdd zivwpet ly dxigaa od mdipia mipeyd zeipekz

xzeia ddeabd ziyi`d zeki`d zelra zeipekzd z` mixgeae dcal zipekz lk ly dzeki` z`

-rete lirl dbvedy dnbicxtdn mibxeg llka mixg` .zeipekz ly zeveawl mb miqgiizn mwlge

xivwza xewqln drixid dxvw .cenild jildza zilxbhpi` zealzyd lynl ,zxg` dyiba mil

qqazn mzixebl` lky oiadl aeyg j` ,xara ervedy miaxd minzixebl`d oian mbcn elit` df

mr z`f znerl .lykidl lelr mzixebl`d ,zeniiwzn `l el` zegpd m` .zeniieqn zegpd lr

wwcfdl xnelk ,xzei aeh cearl ietv mze` gipnd xg` mzixebl` ,zenwiizn xzei zewfg zegpd

ohw zexryd agxn jeza ytgn `edy oeeikn z`f ,mirevia mze`l ribdl icka ze`nbec zegtl

zepey zeiral ."xzeia aehd" cinz `edy cg` dxiga mzixebl` meiwl zetvl ozip `l okl .xzei

zxiga ik xikfdl mewnd mb o`k .dirhe ieqipn qepn oi`e xzei aeh elrti mipey minzixebl`

opyi xy`k xwira ,(overtting) xzi zn`zdl seyg okle envr ipta dcinl jildz `id zeipekz

.oeni` ze`nbec hrn wx j` ce`n zeax zeipekz

ze`vezd xivwz
ly cxtp aly m`d dl`yd z` milrn ep` .zeipekz zxiga ly dzevigpa mipc ep` 2 wxta

ly ozegkep lr xabzdl mileki miipxcen dcinl inzixebl` ile` e` ,zyxcp ok` zeipekz zxiga
X

dirad ly ycg gezip mibivn ep` ef dl`y lr zeprl icka .mnvra zeipekz ly mevr xtqn

ep` oey`x alya .ziqe`b zebltzd zelra zewlgn izyn elxbedy zecewp beiz ly dheytd

zeipekzd xtqn ly divwpetk ez`iby z` migzpn ep` .ziaxnd ze`xpd cixtn xear gezip mibivn

drxkd ly d`ibyd enk drexb zeidl dlelr d`ibydy ceray mi`xne oeni`d ze`nbec xtqn lye

zixyt`d zil`nihte`d d`ibyl zt`ey `id ,zeipekz icin xzeia miynzyn xy`k zi`xw`

zeipekzd xtqn z` yxetn ote`a mi`ven ep` ,sqepa .dnkga zeipekzd xtqn z` mixgea xy`k

ep` ,ipy alya .zeitivtq ze`nbec xtqn xear oeni`d ze`nbec xtqn ly divwpetk iahind

ixitn` ote`a ,meik ziaeyig dcinld xzeia hlead beizd mzixebl` `edy , [41] SVM z` mipgea

,deab cnin mr ahid ccenznd mzixebl`l aygp SVM y zexnly ze`xn ze`vezd .deehn eze`a

zeipekz zxigay zefnxn el` ze`vez .eply gezipd ly zeifgzd z` wiecna min`ez eirevia

miynzyn xy`k mb ,miwiecn micixtn zcinll zekxrn ly oepkza ihixw aikxn deedn oiicr

mieedn `l zeipekz zcicn zeielre miiaeyig miveli` xy`k mbe ,miipxcen dcinl inzixebl`a

.lewiy

miileyd oexwr lr zeqqeand beiz zeiraa zeipekz zxigal zeycg zehiy mirivn ep` 3 wxta

.ezhlgdl qgia cixtn ly epeghia zcin zkxrdl zixhne`ib dcin md [001 ,41] miiley .miiaxnd

hlea mzixebl` `ed SVM ,`nbecl .zieeykr ziaeyig dcinla aeyg ciwtz miwgyn xak miiley

miileyd oexwira yeniya uerp df wxta zebvend ze`veza yecigd .miiaxn miiley lr qqeand

zxigal miycg minzixebl` gztl epl xyt`n miileya yeniyd .zeipekz zxiga myl miiaxnd

1-Nearest-Neighbor ly dllkdd z`iby lr md dllkdd inqg .dllkd inqg gikedl mbe zeipekz

zeipekz zxiga mzixebl`l miitivtq mpi` minqgd .zxgap zeipekz zveawa yeniy dyrp xy`k

zeipekz zveaw xgead zeipekz zxiga mzixebl` lk xear miaeh mirevia migihan `l` ,miieqn

miiley qqean oeixhixwa miynzyn ep` inzixebl`d cva .milecb miiley lr dxiny jez dphw

,miycg zeipekz zxiga inzixebl` ipy mibivn ep` .zeipekz zeveaw ly mzeki` z` cecnl icka

ibeq xtqn lr znbcen el` minzixebl` ly mzleki .df oeixhixw lr miqqeand G-ip eSimba
.mipezp

yeniy miyer ep` .(regression) dtivx divwpet zcinla zeipekz zxigaa mipc ep` 4 wxta

da efl draha dnecd dkxrd zivwpeta miynzyne ztqep mrt 1-Nearest-Neighbor mzixebl`a

j` ix`ipil `l `edy diqxbx ziiraa zeipekz zxigal mzixebl` migztn ep` .3 wxta epynzyd

mihlwa dxhnd zivwpet ly zkaeqn zelz qetzl lbeqn minzixebl`d .lirie heyt z`f mr

mzixebl`d ly erah z` mxiaqn ep` .zirah dinfixlebxk leave-one-out d z`ibya ynzyne


XI

,jynda .mireci minzixebl`l qgia eizepexzi z` minbcne miizek`ln mipezp zxfra epgzity

-xewa ehlwedy miwiitqn ci zrepz zexidn iefig ly xywda epgzity mzixebl`a miynzyn ep`

mirivne iefigd zeki` z` xtyl migilvn ep` zeipekz zxigaa yeniy ici lr .bdpzn sew ly qwh

ote`a) zepeyd zeipekza zakxen zelzy zefnxn ze`vezd .miilpexiep mipezp gezipl dycg jxc

.df iavr cewa zniiw ok` (miieqn onf glta miieqn oexiep ly zelirt `id zipekz lk ,qb

migwel ep`y jk ici lr zeipekz zxiga ly ihxcphqd deehnd z` miaigxn ep` 5 wxta

xegal `id ihxcphqd deehna zeipekz zxiga ly dxhnd .zeipekzd cnina dllkd mb oeayga

dnbec ly xywda wx dpecip dllkde ,zeipekz ly dpezp dveaw jezn zeipekz ly dveaw-zz

zeipekz eli` zexiyi `evnle zeqpl mewna ,df wxta .dycg zipekz ly xywda `le ,dycg

zipekz lky migipn ep` jk myl .zeaeh zeipekz ly mipiit`nd odn cenll miqpn ep` ,xzei zeaeh

oeni`d zveawd ynzydl ozip zrk .zeipekz-dhn mi`xew ep` mdl ,mipiit`n sqe` ici lr zbvein

df ietina ynzydl f`e ,zipekzd ly dzeki`l zipekz ly zeipekz-dhnn ietin cenll zpn lr

zepexzi dyely mpyi .oeni`d zveawa drited `ly dycg zipekz ly dzeki` z` zefgl icka

dcinl ziirak zeipekzd zxiga ziira z` bivdl zxyt`n `id ,oey`x xac .ef dyibl miifkxn

zxiga ly llekd jildzl xzei miaeh dllkd inqg zzl zxyt`n ef zelkzqd .zihxcphq

ytgl milbeqnd zeipekz zxiga inzixebl` gztl epl zxyt`n ef dyib ,ipy xac .beize zeipekz

ef dyib ,seqal .mevr `ed wxtd lr zecnery zeipekzd xtqn xy`k mb zeliria zeaeh zeipekz

rci zxard ly xywda dycgd epzyib z` myiil ozip cvik mb mi`xn ep` .dirad zpadl znxez

dieyr zeaeh zeipekz ly mipiit`nd zxardy mi`xn ep` .( inductive transfer ) dniynl dniynn
z` minibcn ep` .onvr zeaehd zeipekzd odin rcid zxard xy`n xzei zeaeh ze`vezl `iadl

.ci azka zeaezkd zextq iedif ziira lr mipeyd miyeniyl zeipekz-dhna yeniyd

You might also like