The Mathematics of
Learning: Dealing with
Data
Tomaso Poggio and Steve Smale
T
he problem of understanding intelligence
is said to be the greatest problem in science today and “the” problem for this
century—as deciphering the genetic code
was for the second half of the last one.
Arguably, the problem of learning represents a
gateway to understanding intelligence in brains
and machines, to discovering how the human brain
works, and to making intelligent machines that
learn from experience and improve their competences as children do. In engineering, learning techniques would make it possible to develop software
that can be quickly customized to deal with the increasing amount of information and the flood of
data around us.
Examples abound. During the last decades, experiments in particle physics have produced a very
large amount of data. Genome sequencing is doing
the same in biology. The Internet is a vast repository of disparate information which changes rapidly
and grows at an exponential rate: it is now significantly more than 100 terabytes, while the Library
of Congress is about 20 terabytes.
We believe that a set of techniques based on a
new area of science and engineering becoming
known as “supervised learning” will become a key
Tomaso Poggio is Eugene McDermott Professor at the McGovern Institute and Artificial Intelligence Lab, MIT. His
email address is
[email protected].
Steve Smale is emeritus professor of mathematics at the
University of California, Berkeley, and also holds a position at the Toyota Technological Institute at the University of Chicago. His email address is smale@math.
berkeley.edu.
MAY 2003
technology to extract information from the ocean
of bits around us and to make sense of it.
Supervised learning, or learning from examples,
refers to systems that are trained instead of programmed with a set of examples, that is, a set of
input-output pairs. Systems that could learn from
example to perform a specific task would have
many applications. A bank may use a program to
screen loan applications and approve the “good”
ones. Such a system would be trained with a set of
data from previous loan applications and the experience with their defaults. In this example, a loan
application is a point in a multidimensional space
of variables characterizing its properties; its associated output is a binary “good” or “bad” label.
In another example, a car manufacturer may
want to have in its models a system to detect pedestrians about to cross the road to alert drivers to a
possible danger while driving in downtown traffic.
Such a system could be trained with positive and
negative examples: images of pedestrians and images without pedestrians. In fact, software trained
in this way with thousands of images has been recently tested in an experimental car from Daimler.
It runs on a PC in the trunk and looks at the road
in front of the car through a digital camera [1].
Algorithms have been developed that can produce
a diagnosis of the type of cancer from a set of measurements of the expression level of many thousands
of human genes in a biopsy of the tumor measured
with a cDNA microarray containing probes for a
number of genes [1]. Again, the software learns the
classification rule from a set of examples, that is,
from examples of expression patterns in a number
NOTICES
OF THE
AMS
537
particular, we outline the key ideas of decomposing the generalization error of a solution of the
learning problem into a sample and an approximation error component. Thus both probability
theory and approximation theory play key roles in
learning theory. We apply the two theoretical
bounds to the algorithm and describe for it the
tradeoff—which is key in learning theory and its
applications—between number of examples and
complexity of the hypothesis space. We conclude with
several remarks, both with an eye to history and
to open problems for the future.
A Key Algorithm
Figure 1. How can we learn a function which is
capable of generalization—among the many
functions which fit the examples equally well
(here m = 7 )?
of patients with known diagnoses. The challenge
in this case is the high dimensionality of the input
space—on the order of 20, 000 genes—and the
small number of examples available for training—
around 50. In the future, similar learning techniques may be capable of some learning of a language and, in particular, of translating information
from one language to another.
What we assume in the above examples is a machine that is trained instead of programmed to
perform a task, given data of the form (xi , yi )m
i=1 .
Training means synthesizing a function that best
represents the relation between the inputs xi and
the corresponding outputs yi . The central question
of learning theory is how well this function generalizes, that is, how well it estimates the outputs
for previously unseen inputs.
As we will see more formally later, learning techniques are similar to fitting a multivariate function
to a certain number of measurement data. The key
point, as we just mentioned, is that the fitting
should be predictive in the same way that fitting
experimental data (see Figure 1) from an experiment
in physics can in principle uncover the underlying
physical law, which is then used in a predictive way.
In this sense, learning is also a principled method
for distilling predictive and therefore scientific
“theories” from the data.
We begin by presenting a simple “regularization”
algorithm which is important in learning theory and
its applications. We then outline briefly some of its
applications and its performance. Next we provide
a compact derivation of it. We then provide general theoretical foundations of learning theory. In
538
NOTICES
OF THE
The Algorithm
How can we fit the “training” set of data
Sm = (xi , yi )m
i=1 with a function f : X → Y (where X
is a closed subset of Rn and Y ⊂ R ) that generalizes, i.e., is predictive? Here is an algorithm which
does just that and which is almost magical for its
simplicity and effectiveness:
1. Start with data (xi , yi )m
i=1 .
2. Choose a symmetric, positive-definite function
Kx (x′ ) = K(x, x′ ) , continuous
non X × X . A kernel
K(t, s) is positive definite if i,j=1 ci cj K(ti , tj ) ≥ 0
for any n ∈ N and any choice of t1 , . . . , tn ∈ X
and c1 , . . . , cn ∈ R . An example of such a Mercer
kernel is the Gaussian
(1)
K(x, x′ ) = e−
x−x′ 2 /2σ 2
restricted to X × X .
3. Define f : X → Y by
(2)
f (x) =
m
ci Kxi (x)
i=1
where c = (c1 , . . . , cm ) and
(3)
(mγI + K)c = y,
where I is the identity matrix, K is the square
positive-definite matrix with elements
Ki,j = K(xi , xj ) , and y is the vector with coordinates yi . The parameter γ is a positive, real number.
The linear system of equations (3) in m variables
is well-posed, since K is positive and (mγI + K) is
strictly positive. The condition number is good if
mγ is large. This type of system of equations has
been studied since Gauss, and the algorithms for
solving it efficiently represent one of the most developed areas in numerical and computational
analysis.
What does equation (2) say? In the case of a
Gaussian kernel, the equation approximates the unknown function by a weighted superposition of
Gaussian “blobs”, each centered at the location xi
of one of the m examples. The weight ci of each
Gaussian is such as to minimize a regularized empirical error, that is, the error on the training set.
AMS
VOLUME 50, NUMBER 5
The σ of the Gaussian (together with γ, see later)
controls the degree of smoothing, of noise tolerance, and of generalization. Notice that for Gaussians with σ → 0 the representation of equation (2)
effectively becomes a “look-up” table that cannot
generalize (it provides the correct y = yi only when
x = xi and otherwise outputs 0 ).
Performance and Examples
The algorithm performs well in a number of applications involving regression as well as binary
classification. In the latter case the yi of the training set (xi , yi )m
i=1 take the values {−1, +1} ; the predicted label is then {−1, +1} , depending on the sign
of the function f of equation (2).
Regression applications are the oldest. Typically they involved fitting data in a small number
of dimensions [1]. More recently, they also included
typical learning applications, sometimes with a
very high dimensionality. One example is the use
of an algorithm in computer graphics for synthesizing new images [1]. The inverse problem of
estimating facial expression and object pose from
an image is another successful application [1]. Still
another case is the control of mechanical arms.
There are also applications in finance, as, for
instance, the estimation of the price of derivative
securities, such as stock options. In this case the
algorithm replaces the classical Black-Scholes
equation (derived from first principles) by learning the map from an input space (volatility,
underlying stock price, time to expiration of the
option, etc.) to the output space (the price of the
option) from historical data [1].
Binary classification applications abound. The
algorithm was used to perform binary classification on a number of problems [1]. It was also used
to perform visual object recognition in a
view-independent way and in particular face recognition and sex categorization from face images [1].
Other applications span bioinformatics for classification of human cancer from microarray data,
text summarization, and sound classification.1
Surprisingly, it has been realized quite recently
that the same linear algorithm not only works well
but is fully comparable in binary classification
problems to the most popular classifiers of today
(that turn out to be of the same family; see later).
Derivation
The algorithm described can be derived from
Tikhonov regularization. To find the minimizer of
the error, we may try to solve the problem—called
Empirical Risk Minimization (ERM)—of finding the
function in H that minimizes
1The very closely related Support Vector Machine (SVM)
classifier was used for the same family of applications and
in particular for bioinformatics and for face recognition
and car and pedestrian detection [1].
MAY 2003
m
1
(f (xi ) − yi )2 ,
m i=1
which is in general ill-posed, depending on the
choice of the hypothesis space H . Following
Tikhonov (see for instance [8]), we minimize instead
over the hypothesis space HK , for a fixed positive
parameter γ, the regularized functional
m
1
(4)
(yi − f (xi ))2 + γ f 2K ,
m i=1
where f 2K is the norm in HK, the Reproducing Kernel Hilbert Space (RKHS) defined by the kernel K.
The last term in equation (4)—called regularizer—
forces, as we will see, smoothness and uniqueness
of the solution.
Let us first define the norm f 2K. Consider the
space of the linear span of Kxj . We use xj to emphasize that the elements of X used in this construction do not have anything to do in general with
the training set (xi )m
i=1 . Define an inner product in
this space by setting
r Kx , Kxj = K(x, xj ) and extending linearly to j=1 aj Kxj . The completion of
the space in the associated norm is the RKHS, that
is, a Hilbert space HK with the norm f 2K (see [4]).
Note that f , Kx = f (x) for f ∈ HK (just let f = Kxj
and extend linearly).
To minimize the functional in equation (4), we
take the functional derivative with respect to f ,
apply it to an element f of the RKHS, and set it equal
to 0 . We obtain
(5)
m
1
(yi − f (xi ))f (xi ) − γf , f = 0.
m i=1
Equation (5) must be valid for any f . In particular, setting f = Kx gives
f (x) =
(6)
m
ci Kxi (x)
i=1
where
ci =
(7)
yi − f (xi )
mγ
since f , Kx = f (x) . Equation (3) then follows by
substituting equation (6) into equation (7).
Notice also that essentially the same derivation
for a generic loss function V (y, f (x)) , instead of
(f (x) − y)2 , yields the same equation (6), but equation (3) is now different and, in general, nonlinear,
depending on the form of V . In particular, the
popular Support Vector Machine (SVM) regression
and SVM classification algorithms correspond
to special choices of nonquadratic V : one to provide a “robust” measure of error and the other to
approximate the ideal loss function corresponding
to binary (mis)classification. In both cases the
solution is still of the same form as equation (6) for
NOTICES
OF THE
AMS
539
any choice of the kernel K. The coefficients ci are
no longer given by equation (7) but must be found
by solving a quadratic programming problem.
Theory
We give some further justification of the algorithm
by sketching very briefly its foundations in some
basic ideas of learning theory.
Here the data (xi , yi )m
i=1 is supposed random so
that there is an unknown probability measure ρ on
the product space X × Y from which the data is
drawn.
This measure ρ defines a function
fρ : X → Y
(8)
satisfying fρ (x) = y dρx, where ρx is the conditional measure on x × Y .
From this construction fρ can be said to be
the true input-output function reflecting the
environment which produces the data. Thus a
measurement of the error of f is
(9)
(f − fρ )2 dρX ,
X
where ρX is the measure on X induced by ρ (sometimes called the marginal measure).
The goal of learning theory might be said to
“find” f minimizing this error. Now to search for
such an f , it is important to have a space H —the
hypothesis space—in which to work (“learning
does not take place in a vacuum”). Thus consider
a convex space of continuous functions f : X → Y
(remember Y ⊂ R ) that as a subset of C(X) is
compact, where C(X) is the Banach space of
continuous functions with f = maxX |f (x)| .
A basic example is
H = IK (BR )
(10)
where IK : HK → C(X) is the inclusion and BR is the
ball of radius R in HK .
Starting from
the data (xi , yi )m
i=1 = z , one may
1 m
minimize m i=1 (f (xi ) − yi )2 over f ∈ H to obtain
a unique hypothesis fz : X → Y. This fz is called
the empirical optimum, and we may focus on the
problem of estimating
(11)
(fz − fρ )2 dρX .
X
It is useful towards this end to break the problem
into steps by defining a “true optimum” fH relative
to H by taking the minimum over H of X (f − fρ )2 .
Thus we may exhibit
2
(fz − fρ ) = S(z, H ) + (fH − fρ )2
X
X
(12)
= S(z, H ) + A(H ),
where
540
NOTICES
OF THE
(13)
S(z, H ) =
X
(fz − fρ )2 −
X
(fH − fρ )2 .
The first term, (S ) on the right-hand side in
equation (12), must be estimated in probability
over z , and the estimate is called the sample error
(sometimes also the estimation error). It is naturally
studied in the theory of probability and of empirical processes [7]. The second term, (A ), is dealt with
via approximation theory (for a review see [6] and
also [10], [1]) and is called the approximation error.
The decomposition of equation (12) is indirectly related to the well-known bias and variance decomposition in statistics.
Sample Error
First consider an estimate for the sample error,
which will have the form
(14)
S(z, H ) ≤ ǫ
with high confidence, this confidence depending
on ǫ and on the sample size m.
Let us be more precise. Recall that the covering
number Cov#(H , η) is the number of balls in H
of radius η needed to cover H .
Theorem 1. Suppose |f (x) − y| ≤ M for all f ∈ H
for almost all (x, y) ∈ X × Y . Then
P r obz∈(X×Y )m {S(z, H ) ≤ ǫ} ≤ 1 − δ
2
where δ = Cov#(H , ǫ/24M)e−mǫ/288M .
The result is Theorem C ∗ of [4], but earlier versions (usually without a topology on H ) have been
proved by others, especially Vapnik, who formulated the notion of VC dimension to measure the
complexity of the hypothesis space for the case of
{0, 1} functions.
In a typical case of Theorem 1 the hypothesis
space H is taken to be as in equation (10), where
BR is the ball of radius R in an RKHS with a smooth
K (or in a Sobolev space). In this context R plays
an analogous role to VC dimension [16]. Estimates
for the covering numbers in these cases were provided by Cucker, Smale, and Zhou [1].
The proof of Theorem 1 starts from the
Hoeffding inequality (which can be regarded as
an exponential version of Chebyshev’s inequality
of probability theory). One applies this estimate
to the function X × Y → R which takes (x, y) to
(f (x) − y)2 . Then extending the estimate to the set
of f ∈ H introduces the covering number into
the picture. With a little more work, Theorem 1 is
obtained.
Approximation Error
The approximation error X (fH − fρ )2 may be
studied as follows.
Suppose B : L2 → L2 is a compact, strictly positive (selfadjoint) operator. Then let E be the Hilbert
space
AMS
VOLUME 50, NUMBER 5
{g ∈ L2 , B −s g < ∞}
with inner product g, hE = B −s g, B −s hL2 . Suppose moreover that E → L2 factors as
E → C(X) → L2 with the inclusion JE : E ֓ C(X) well
defined and compact.
Let H be JE (BR ) when BR is the ball of radius R
in E . A theorem on the approximation error is
Theorem 2. Let 0 < r < s and H be as above. Then
fρ − fH
2
2r
≤ (1/R) s−r B −r fρ
2s
s−r
.
We now use · for the norm in the space of
square-integrable functions on X, with measure
1/2
ρX . For our main example of RKHS, take B = LK ,
where K is a Mercer kernel,
(15)
LK f (x) =
f (x′ )K(x, x′ ),
X
and we have taken the square root of the operator
LK . In this case E is HK as above.
Details and proofs may be found in [4] and in
[15].
Sample and Approximation Error for the
Regularization Algorithm
The previous discussion depends upon a compact
hypothesis space H from which the experimental
optimum fz and the true optimum fH are taken. In
the key algorithm of the preceding section, the
optimization is done over all f ∈ HK with a regularized error function. The error analysis of the
preceding two subsections must therefore be
extended.
Thus let fγ,z be the empirical optimum for the
regularized problem as exhibited in equation (4):
m
1
(yi − f (xi ))2 + γ f
m i=1
2
K.
Then
(16)
(fγ,z − fρ )2 ≤ S(γ) + A(γ)
where A(γ) (the approximation error in this context) is
(17)
−1
A(γ) = γ 1/2 LK 4 fρ
2
and the sample error is
(18)
S(γ) =
32M 2 (γ + C)2 ∗
v (m, δ)
γ2
where v ∗ (m, δ) is the unique solution of
(19)
m 3
4m
v − ln(
)v − c1 = 0.
4
δ
Here C, c1 > 0 depend only on X and K. For the
proof one reduces to the case of compact H and applies Theorems 1 and 2. Thus finding the optimal
solution is equivalent to finding the best tradeoff
MAY 2003
between A and S for a given m. In our case, this
bias-variance problem is to minimize S(γ) + A(γ)
over γ > 0 . There is a unique solution—a best γ—
for the choice in equation (4). For this result and
its consequences see [5].
Remarks
The Tradeoff between Sample Complexity and
Hypothesis Space Complexity
For a given fixed hypothesis space H , only the
sample error component of the error of fz can be
controlled (in equation (12) only S(z, H ) depends
on the data). In this view, convergence of S to zero
as the number of data points increases (Theorem 1)
is then the central problem in learning. Vapnik
called consistency of ERM (i.e., convergence of the
empirical error to the true error) the key problem
in learning theory, and in fact much modern
work has focused on refining the necessary and
sufficient conditions for consistency of ERM (the
uniform Glivenko-Cantelli property of H , finite Vγ
dimension for γ > 0 , etc.; see [8]). More generally,
however, there is a tradeoff between minimizing
the sample error and minimizing the approximation error—what we referred to as the bias-variance
problem. Increasing the number m of data points
decreases the sample error. The effect of increasing the complexity of the hypothesis space is
trickier. Usually the approximation error decreases
but the sample error increases. This means that
there is an optimal complexity of the hypothesis
space for a given number of training data. In the
case of the regularization algorithm described in
this paper, this tradeoff corresponds to an optimum
value for γ as studied in [3], [5], [11]. In empirical
work, the optimum value is often found through
cross-validation techniques [18].
This tradeoff between approximation error and
sample error is probably the most critical issue in
determining good performance on a given problem.
The class of regularization algorithms, such as
equation (4), shows clearly that it is also a tradeoff—
quoting Girosi—between the curse of dimensionality
(not enough examples) and the blessing of smoothness
(which decreases the effective “dimensionality”, i.e.,
the complexity of the hypothesis space) through the
parameter γ.
The Regularization Algorithm and Support
Vector Machines
There is nothing to stop us from using the algorithm we described in this paper—that is, square
loss regularization—for binary classification.
Whereas SVM classification arose from using—with
binary y —the loss function
V (f (x, y)) = (1 − yf (x))+ ,
we can perform least-squares regularized classification via the loss function
NOTICES
OF THE
AMS
541
800
SVM
RLSC
0.131 0.129
250
SVM
RLSC
0.167 0.165
100
SVM
RLSC
0.214 0.211
Table 1. A comparison of SVM and RLSC
(Regularized Least Squares Classification)
accuracy on a multiclass classification task (the
20newsgroups dataset with 20 classes and high
dimensionality, around 50,000), performed
using the standard “one versus all” scheme
based on the use of binary classifiers. The top
row indicates the number of documents/class
used for training. Entries in the table are the
fraction of misclassified documents. From [14].
SVM
0.072
52
RLSC
0.066
SVM
0.176
20
RLSC
0.169
SVM
0.341
10
RLSC
0.335
Table 2. A comparison of SVM and RLSC
accuracy on another multiclass classification
task (the sector105 dataset, consisting of 105
classes with dimensionality about 50,000). The
top row indicates the number of
documents/class used for training. Entries in
the table are the fraction of misclassified
documents. From [14].
V (f (x, y)) = (f (x) − y)2 .
This classification scheme was used at least as
early as 1989 (for reviews see [13]) and then rediscovered again by many others (see [1]), including Mangasarian (who refers to square loss regularization as “proximal vector machines”) and
Suykens (who uses the name “least square SVMs”).
Rifkin [14] has confirmed the interesting empirical results by Mangasarian and Suykens: “classical”
square loss regularization works well also for binary classification (examples are in Tables 1 and 2).
In references to supervised learning, the Support
Vector Machine method is often described (see
for instance R. M. Karp’s article in the May 2002
issue of the Notices) according to the “traditional”
approach, introduced by Vapnik and followed by
almost everybody else. In this approach, one starts
with the concepts of separating hyperplanes and
margin. Given the data, one searches for the linear
hyperplane that separates the positive and the negative examples, assumed to be linearly separable,
with the largest margin (the margin is defined as
the distance from the hyperplane to the nearest
example). Most articles and books follow this
approach, go from the separable to the nonseparable case, and use a so-called “kernel trick” (!) to
extend it to the nonlinear case. SVM for classification was introduced by Vapnik in the linear, separable case in terms of maximizing the margin. In
the nonseparable case, the margin motivation
loses most of its meaning. A more general and
simpler framework for deriving SVM algorithms
542
NOTICES
OF THE
for classification and regression is to regard them
as special cases of regularization and follow the
treatment of the section above on the key algorithm. In the case of linear functions f (x) = w · x
and separable data, maximizing the margin is
exactly equivalent to maximizing 1/ w , which
is in turn equivalent to minimizing w 2 , which
corresponds to minimizing the RKHS norm.
The Regularization Algorithm and Learning
Theory
The Mercer theorem was introduced in learning
theory by Vapnik; RKHS were explicitly introduced
in learning theory by Girosi and later by Vapnik [1],
[16]. Poggio and Girosi [13], [1] had introduced
Tikhonov regularization in learning theory. Earlier,
Gaussian Radial Basis Functions were proposed
as an alternative to neural networks by Broomhead
and Loewe. Of course, RKHS had been pioneered
by Parzen and Wahba ([12], [18]; for a review see [18])
for applications closely related to learning, including data smoothing (for image processing and
computer vision, see [1]).
A Bayesian Interpretation
The learning algorithm equation (4) has an interesting Bayesian interpretation [18]: the data term—that
is, the first term with the quadratic loss function—is
a model of (Gaussian, additive) noise, and the RKHS
norm (the stabilizer) corresponds to a prior probability on the hypothesis space H .
Let us define P [f |Sm ] as the conditional probability
of the function f given the training examples
Sm = (xi , yi )m
i=1 , P [Sm |f ] as the conditional probability of Sm given f , i.e., a model of the noise, and P [f ]
as the a priori probability of the random field f .
Then Bayes’s theorem provides the posterior
distribution as
P [Sm |f ] P [f ]
P [f |Sm ] =
.
P (Sm )
If the noise is normally distributed with variance
σ , then the probability P [Sm |f ] is
P [Sm |f ] =
1 − 1 2 m
(y −f (xi ))2
e 2σ i=1 i
ZL
where ZL is a normalization constant.
2
1
If P [f ] = Zr e− f K , where Zr is another normalization constant, then
P [f |Sm ] =
1
−
e
ZD ZL Zr
1
2σ 2
m
2
i=1 (yi −f (xi )) +
f 2
K
.
One of the several possible estimates of f from
P [f |Sm ] is the so-called Maximum A Posteriori (MAP)
estimate, that is,
m
max P [f |Sm ] = min (yi − f (xi ))2 + 2σ 2 f 2K ,
i=1
which is the same as the regularization functional
if λ = 2σ 2 /m (for details and extensions to models
AMS
VOLUME 50, NUMBER 5
of non-Gaussian noise and different loss functions,
see [8]).
Necessary and Sufficient Conditions for
Learnability
Compactness of the hypothesis space H is sufficient for consistency of ERM, that is, for bounds of
the type of Theorem 1 on the sample error. The
necessary and sufficient condition is that H is a
uniform Glivenko-Cantelli class of functions, in
which case no specific topology is assumed for
H .2 There are several equivalent conditions on H
such as finiteness of the Vγ dimension for all
positive γ (which reduces to finiteness of the VC
dimension for {0, 1} functions).3
We saw earlier that the regularization algorithm
equation (4) ensures (through the resulting compactness of the “effective” hypothesis space)
well-posedness of the problem. It also yields
convergence of the empirical error to the true error
(i.e., bounds such as Theorem 1). An open question
is whether there is a connection between wellposedness and consistency. For well-posedness
the critical condition is usually stability of the
solution. In the learning problem, this condition
refers to stability of the solution of ERM with
respect to small changes of the training set Sm . In
a similar way, the condition number characterizes
the stability of the solution of equation (3). Is it
possible that some specific form of stability may
be necessary and sufficient for consistency of ERM?
Such a result would be surprising, because, a
priori, there is no reason why there should be a
connection between well-posedness and consistency; they are both important requirements for
ERM, but they seem quite different and independent of each other.
Learning Theory, Sample Complexity, and Brains
The theory of supervised learning outlined in this
paper and in the references has achieved a
2Definition: A class F of functions f is a uniform Glivenko-
Cantelli class if for every ε > 0
lim sup P{sup|Eρm f − Eρ f | > ε} = 0,
m→∞
ρ
f ∈F
where ρn is the empirical measure supported on a set
x1 , . . . , xn .
3In [2], following [17], a necessary and sufficient condition is
proved for uniform convergence of |Iemp [f ] − Iexp [f ]| , in
terms of the finiteness for all γ > 0 of a combinatorial quantity called the Vγ dimension of F (which is the set
V (x), f (x), f ∈ H ), under some assumptions on V . The result
is based on a necessary and sufficient (distribution independent) condition which uses the metric entropy of F defined
Hm (ǫ, F ) = supxm ∈X m log N (ǫ, F , xm ) ,
as
where
N (ǫ, F , xm ) is the ǫ -covering of F with respect to lx∞m ( lx∞m is
the l ∞ distance on the points xm ):
Theorem [Dudley, Giné, and Zinn]. F is a uniform GlivenkoCantelli class if and only if limm→∞ Hm (ǫ, F )/m = 0 for
all ǫ > 0 .
MAY 2003
remarkable degree of completeness and of practical success in many applications. Within it, many
interesting problems remain open and are a fertile
ground for interesting and useful mathematics.
One may also take a broader view and ask, What
next?
One could argue that the most important aspect
of intelligence and of the amazing performance
of real brains is the ability to learn. How then do
the learning machines we have described in the
theory compare with brains? There are of course
many aspects of biological learning that are not
captured by the theory and several difficulties in
making any comparison. One of the most obvious
differences, however, is the ability of people and
animals to learn from very few examples. The
algorithms we have described can learn an object
recognition task from a few thousand labeled
images. This is a small number compared with the
apparent dimensionality of the problem (millions
of pixels), but a child, or even a monkey, can learn
the same task from just a few examples. Of course,
evolution has probably done a part of the learning,
but so have we, when we choose for any given task
an appropriate input representation for our learning machine. From this point of view, as Donald
Geman has argued, the interesting limit is not “m
goes to infinity”, but rather “m goes to zero”. Thus
an important area for future theoretical and experimental work is learning from partially labeled
examples (and the related area of active learning).
In the first case there are only a small number ℓ
of labeled pairs (xi , yi )ℓi=1 —for instance, with yi binary—and many unlabeled data (xi )m
ℓ+1 , m ≫ ℓ .
Though interesting work has begun in this direction, a satisfactory theory that provides conditions
under which unlabeled data can be used is still lacking.
A comparison with real brains offers another,
and probably related, challenge to learning theory.
The “learning algorithms” we have described in
this paper correspond to one-layer architectures.
Are hierarchical architectures with more layers
justifiable in terms of learning theory? It seems that
the learning theory of the type we have outlined
does not offer any general argument in favor of
hierarchical learning machines for regression or
classification. This is somewhat of a puzzle, since the
organization of cortex—for instance, visual cortex—
is strongly hierarchical. At the same time, hierarchical learning systems show superior performance in
several engineering applications. For instance, a
face categorization system in which a single SVM
classifier combines the real-valued output of a few
classifiers, each trained to a different component of
faces (such as eye and nose), outperforms a single
classifier trained on full images of faces [1]. The
theoretical issues surrounding hierarchical systems
of this type are wide open and likely to be of
NOTICES
OF THE
AMS
543
paramount importance for the next major development of efficient classifiers in several application domains.
Why hierarchies? There may be reasons of
efficiency—computational speed and use of computational resources. For instance, the lowest
levels of the hierarchy may represent a dictionary
of features that can be shared across multiple
classification tasks (see [9]). Hierarchical systems
usually decompose a task in a series of simple
computations at each level—often an advantage
for fast implementations. There may also be the
more fundamental issue of sample complexity. We
mentioned that an obvious difference between our
best classifiers and human learning is the number
of examples required in tasks such as object
detection. The theory described in this paper shows
that the difficulty of a learning task depends on
the size of the required hypothesis space. This
complexity determines in turn how many training
examples are needed to achieve a given level of
generalization error. Thus the complexity of the
hypothesis space sets the speed limit and the
sample complexity for learning. If a task—like a
visual recognition task—can be decomposed into
low-complexity learning tasks, for each layer of a
hierarchical learning machine, then each layer may
require only a small number of training examples.
Of course, not all classification tasks have a hierarchical representation. Roughly speaking, the
issue is under which conditions a function of
many variables can be approximated by a function
of a small number of functions of subsets of the
original variables. Neuroscience suggests that
what humans can learn can be represented by
hierarchies that are locally simple. Thus our
ability of learning from just a few examples, and
its limitations, may be related to the hierarchical
architecture of cortex. This is just one of several
possible connections, still to be characterized,
between learning theory and the ultimate problem
in natural science—the organization and the
principles of higher brain functions.
[4] F. CUCKER and S. SMALE, On the mathematical foundations of learning, Bull. Amer. Math. Soc. (N.S.) 39
(2001), 1–49.
[5] ——— , Best choices for regularization parameters in
learning theory: On the bias-variance problem,
Foundations Comput. Math. 2 (2002), no. 4, 413–428.
[6] R. A. DEVORE, Nonlinear approximation, Acta Numer.
7 (1998), 51–150.
[7] L. DEVROYE, L. GYÖRFI, and G. LUGOSI, A Probablilistic
Theory of Pattern Recognition, Appl. Math., vol. 31,
Springer, New York, 1996.
[8] T. EVGENIOU, M. PONTIL, and T. POGGIO, Regularization
networks and support vector machines, Adv. Comput.
Math. 13 (2000), 1–50.
[9] T. HASTIE, R. TIBSHIRANI, and J. FRIEDMAN, The Elements
of Statistical Learning, Springer-Verlag, Basel, 2001.
[10] C. A. MICCHELLI, Interpolation of scattered data: Distance
matrices and conditionally positive functions, Constr.
Approx. 2 (1986), 11–22.
[11] P. NIYOGI and F. GIROSI, On the relationship between generalization error, hypothesis complexity, and sample
complexity for radial basis functions, Neural Comput. 8
(1996), 819–842.
[12] E. PARZEN, An approach to time series analysis, Amer.
Math. Statist. 32 (1961), 951–989.
[13] T. POGGIO and F. GIROSI, Networks for approximation
and learning, Proc. IEEE 78 (1990), no. 9.
[14] R. M. RIFKIN, Everything old is new again: A fresh look
at historical approaches to machine learning, Ph.D.
thesis, Massachusetts Institute of Technology, 2002.
[15] S. SMALE and D. ZHOU, Estimating the approximation
error in learning theory, Anal. Appl. 1 (2003), 1–25.
[16] V. N. VAPNIK, Statistical Learning Theory, Wiley, New
York, 1998.
[17] V. N. VAPNIK and A. Y. CHERVONENKIS, On the uniform
convergence of relative frequencies of events to their
probabilities, Theory Probab. Appl. 17 (1971), no. 2,
264–280.
[18] G. WAHBA, Splines Models for Observational Data, Series
in Applied Mathematics, vol. 59, SIAM, Philadelphia, PA,
1990.
Acknowledgments
Thanks to Felipe Cucker, Federico Girosi, Don
Glaser, Sayan Mukherjee, Martino Poggio, and Ryan
Rifkin.
References
[1] For a full list of references, including specific
applications mentioned in the article, see http://
www.ai.mit.edu/projects/cbcl/projects/
NoticesAMS/PoggioSmale.htm.
[2] N. ALON, S. BEN-DAVID, N. CESA-BIANCHI, and D. HAUSSLER,
Scale-sensitive dimensions, uniform convergence, and
learnability, J. ACM 44 (1997), no. 4, 615–631.
[3] A. R. BARRON, Approximation and estimation bounds
for artificial neural networks, Machine Learning 14
(1994), 115–133.
544
NOTICES
OF THE
AMS
VOLUME 50, NUMBER 5