The Mathematics of Learning: Dealing with Data

TOMASO POGGIO

The Mathematics of Learning: Dealing with Data

TOMASO POGGIO

2005 International Conference on Neural Networks and Brain

visibility

…

description

8 pages

link

1 file

The Mathematics of Learning: Dealing with Data Tomaso Poggio and Steve Smale T he problem of understanding intelligence is said to be the greatest problem in science today and “the” problem for this century—as deciphering the genetic code was for the second half of the last one. Arguably, the problem of learning represents a gateway to understanding intelligence in brains and machines, to discovering how the human brain works, and to making intelligent machines that learn from experience and improve their competences as children do. In engineering, learning techniques would make it possible to develop software that can be quickly customized to deal with the increasing amount of information and the flood of data around us. Examples abound. During the last decades, experiments in particle physics have produced a very large amount of data. Genome sequencing is doing the same in biology. The Internet is a vast repository of disparate information which changes rapidly and grows at an exponential rate: it is now significantly more than 100 terabytes, while the Library of Congress is about 20 terabytes. We believe that a set of techniques based on a new area of science and engineering becoming known as “supervised learning” will become a key Tomaso Poggio is Eugene McDermott Professor at the McGovern Institute and Artificial Intelligence Lab, MIT. His email address is [email protected]. Steve Smale is emeritus professor of mathematics at the University of California, Berkeley, and also holds a position at the Toyota Technological Institute at the University of Chicago. His email address is smale@math. berkeley.edu. MAY 2003 technology to extract information from the ocean of bits around us and to make sense of it. Supervised learning, or learning from examples, refers to systems that are trained instead of programmed with a set of examples, that is, a set of input-output pairs. Systems that could learn from example to perform a specific task would have many applications. A bank may use a program to screen loan applications and approve the “good” ones. Such a system would be trained with a set of data from previous loan applications and the experience with their defaults. In this example, a loan application is a point in a multidimensional space of variables characterizing its properties; its associated output is a binary “good” or “bad” label. In another example, a car manufacturer may want to have in its models a system to detect pedestrians about to cross the road to alert drivers to a possible danger while driving in downtown traffic. Such a system could be trained with positive and negative examples: images of pedestrians and images without pedestrians. In fact, software trained in this way with thousands of images has been recently tested in an experimental car from Daimler. It runs on a PC in the trunk and looks at the road in front of the car through a digital camera [1]. Algorithms have been developed that can produce a diagnosis of the type of cancer from a set of measurements of the expression level of many thousands of human genes in a biopsy of the tumor measured with a cDNA microarray containing probes for a number of genes [1]. Again, the software learns the classification rule from a set of examples, that is, from examples of expression patterns in a number NOTICES OF THE AMS 537 particular, we outline the key ideas of decomposing the generalization error of a solution of the learning problem into a sample and an approximation error component. Thus both probability theory and approximation theory play key roles in learning theory. We apply the two theoretical bounds to the algorithm and describe for it the tradeoff—which is key in learning theory and its applications—between number of examples and complexity of the hypothesis space. We conclude with several remarks, both with an eye to history and to open problems for the future. A Key Algorithm Figure 1. How can we learn a function which is capable of generalization—among the many functions which fit the examples equally well (here m = 7 )? of patients with known diagnoses. The challenge in this case is the high dimensionality of the input space—on the order of 20, 000 genes—and the small number of examples available for training— around 50. In the future, similar learning techniques may be capable of some learning of a language and, in particular, of translating information from one language to another. What we assume in the above examples is a machine that is trained instead of programmed to perform a task, given data of the form (xi , yi )m i=1 . Training means synthesizing a function that best represents the relation between the inputs xi and the corresponding outputs yi . The central question of learning theory is how well this function generalizes, that is, how well it estimates the outputs for previously unseen inputs. As we will see more formally later, learning techniques are similar to fitting a multivariate function to a certain number of measurement data. The key point, as we just mentioned, is that the fitting should be predictive in the same way that fitting experimental data (see Figure 1) from an experiment in physics can in principle uncover the underlying physical law, which is then used in a predictive way. In this sense, learning is also a principled method for distilling predictive and therefore scientific “theories” from the data. We begin by presenting a simple “regularization” algorithm which is important in learning theory and its applications. We then outline briefly some of its applications and its performance. Next we provide a compact derivation of it. We then provide general theoretical foundations of learning theory. In 538 NOTICES OF THE The Algorithm How can we fit the “training” set of data Sm = (xi , yi )m i=1 with a function f : X → Y (where X is a closed subset of Rn and Y ⊂ R ) that generalizes, i.e., is predictive? Here is an algorithm which does just that and which is almost magical for its simplicity and effectiveness: 1. Start with data (xi , yi )m i=1 . 2. Choose a symmetric, positive-definite function Kx (x′ ) = K(x, x′ ) , continuous non X × X . A kernel K(t, s) is positive definite if i,j=1 ci cj K(ti , tj ) ≥ 0 for any n ∈ N and any choice of t1 , . . . , tn ∈ X and c1 , . . . , cn ∈ R . An example of such a Mercer kernel is the Gaussian (1) K(x, x′ ) = e− x−x′ 2 /2σ 2 restricted to X × X . 3. Define f : X → Y by (2) f (x) = m ci Kxi (x) i=1 where c = (c1 , . . . , cm ) and (3) (mγI + K)c = y, where I is the identity matrix, K is the square positive-definite matrix with elements Ki,j = K(xi , xj ) , and y is the vector with coordinates yi . The parameter γ is a positive, real number. The linear system of equations (3) in m variables is well-posed, since K is positive and (mγI + K) is strictly positive. The condition number is good if mγ is large. This type of system of equations has been studied since Gauss, and the algorithms for solving it efficiently represent one of the most developed areas in numerical and computational analysis. What does equation (2) say? In the case of a Gaussian kernel, the equation approximates the unknown function by a weighted superposition of Gaussian “blobs”, each centered at the location xi of one of the m examples. The weight ci of each Gaussian is such as to minimize a regularized empirical error, that is, the error on the training set. AMS VOLUME 50, NUMBER 5 The σ of the Gaussian (together with γ, see later) controls the degree of smoothing, of noise tolerance, and of generalization. Notice that for Gaussians with σ → 0 the representation of equation (2) effectively becomes a “look-up” table that cannot generalize (it provides the correct y = yi only when x = xi and otherwise outputs 0 ). Performance and Examples The algorithm performs well in a number of applications involving regression as well as binary classification. In the latter case the yi of the training set (xi , yi )m i=1 take the values {−1, +1} ; the predicted label is then {−1, +1} , depending on the sign of the function f of equation (2). Regression applications are the oldest. Typically they involved fitting data in a small number of dimensions [1]. More recently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of an algorithm in computer graphics for synthesizing new images [1]. The inverse problem of estimating facial expression and object pose from an image is another successful application [1]. Still another case is the control of mechanical arms. There are also applications in finance, as, for instance, the estimation of the price of derivative securities, such as stock options. In this case the algorithm replaces the classical Black-Scholes equation (derived from first principles) by learning the map from an input space (volatility, underlying stock price, time to expiration of the option, etc.) to the output space (the price of the option) from historical data [1]. Binary classification applications abound. The algorithm was used to perform binary classification on a number of problems [1]. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images [1]. Other applications span bioinformatics for classification of human cancer from microarray data, text summarization, and sound classification.1 Surprisingly, it has been realized quite recently that the same linear algorithm not only works well but is fully comparable in binary classification problems to the most popular classifiers of today (that turn out to be of the same family; see later). Derivation The algorithm described can be derived from Tikhonov regularization. To find the minimizer of the error, we may try to solve the problem—called Empirical Risk Minimization (ERM)—of finding the function in H that minimizes 1The very closely related Support Vector Machine (SVM) classifier was used for the same family of applications and in particular for bioinformatics and for face recognition and car and pedestrian detection [1]. MAY 2003 m 1 (f (xi ) − yi )2 , m i=1 which is in general ill-posed, depending on the choice of the hypothesis space H . Following Tikhonov (see for instance [8]), we minimize instead over the hypothesis space HK , for a fixed positive parameter γ, the regularized functional m 1 (4) (yi − f (xi ))2 + γ f 2K , m i=1 where f 2K is the norm in HK, the Reproducing Kernel Hilbert Space (RKHS) defined by the kernel K. The last term in equation (4)—called regularizer— forces, as we will see, smoothness and uniqueness of the solution. Let us first define the norm f 2K. Consider the space of the linear span of Kxj . We use xj to emphasize that the elements of X used in this construction do not have anything to do in general with the training set (xi )m i=1 . Define an inner product in this space by setting r Kx , Kxj = K(x, xj ) and extending linearly to j=1 aj Kxj . The completion of the space in the associated norm is the RKHS, that is, a Hilbert space HK with the norm f 2K (see [4]). Note that f , Kx = f (x) for f ∈ HK (just let f = Kxj and extend linearly). To minimize the functional in equation (4), we take the functional derivative with respect to f , apply it to an element f of the RKHS, and set it equal to 0 . We obtain (5) m 1 (yi − f (xi ))f (xi ) − γf , f = 0. m i=1 Equation (5) must be valid for any f . In particular, setting f = Kx gives f (x) = (6) m ci Kxi (x) i=1 where ci = (7) yi − f (xi ) mγ since f , Kx = f (x) . Equation (3) then follows by substituting equation (6) into equation (7). Notice also that essentially the same derivation for a generic loss function V (y, f (x)) , instead of (f (x) − y)2 , yields the same equation (6), but equation (3) is now different and, in general, nonlinear, depending on the form of V . In particular, the popular Support Vector Machine (SVM) regression and SVM classification algorithms correspond to special choices of nonquadratic V : one to provide a “robust” measure of error and the other to approximate the ideal loss function corresponding to binary (mis)classification. In both cases the solution is still of the same form as equation (6) for NOTICES OF THE AMS 539 any choice of the kernel K. The coefficients ci are no longer given by equation (7) but must be found by solving a quadratic programming problem. Theory We give some further justification of the algorithm by sketching very briefly its foundations in some basic ideas of learning theory. Here the data (xi , yi )m i=1 is supposed random so that there is an unknown probability measure ρ on the product space X × Y from which the data is drawn. This measure ρ defines a function fρ : X → Y (8) satisfying fρ (x) = y dρx, where ρx is the conditional measure on x × Y . From this construction fρ can be said to be the true input-output function reflecting the environment which produces the data. Thus a measurement of the error of f is (9) (f − fρ )2 dρX , X where ρX is the measure on X induced by ρ (sometimes called the marginal measure). The goal of learning theory might be said to “find” f minimizing this error. Now to search for such an f , it is important to have a space H —the hypothesis space—in which to work (“learning does not take place in a vacuum”). Thus consider a convex space of continuous functions f : X → Y (remember Y ⊂ R ) that as a subset of C(X) is compact, where C(X) is the Banach space of continuous functions with f = maxX |f (x)| . A basic example is H = IK (BR ) (10) where IK : HK → C(X) is the inclusion and BR is the ball of radius R in HK . Starting from the data (xi , yi )m i=1 = z , one may 1 m minimize m i=1 (f (xi ) − yi )2 over f ∈ H to obtain a unique hypothesis fz : X → Y. This fz is called the empirical optimum, and we may focus on the problem of estimating (11) (fz − fρ )2 dρX . X It is useful towards this end to break the problem into steps by defining a “true optimum” fH relative to H by taking the minimum over H of X (f − fρ )2 . Thus we may exhibit 2 (fz − fρ ) = S(z, H ) + (fH − fρ )2 X X (12) = S(z, H ) + A(H ), where 540 NOTICES OF THE (13) S(z, H ) = X (fz − fρ )2 − X (fH − fρ )2 . The first term, (S ) on the right-hand side in equation (12), must be estimated in probability over z , and the estimate is called the sample error (sometimes also the estimation error). It is naturally studied in the theory of probability and of empirical processes [7]. The second term, (A ), is dealt with via approximation theory (for a review see [6] and also [10], [1]) and is called the approximation error. The decomposition of equation (12) is indirectly related to the well-known bias and variance decomposition in statistics. Sample Error First consider an estimate for the sample error, which will have the form (14) S(z, H ) ≤ ǫ with high confidence, this confidence depending on ǫ and on the sample size m. Let us be more precise. Recall that the covering number Cov#(H , η) is the number of balls in H of radius η needed to cover H . Theorem 1. Suppose |f (x) − y| ≤ M for all f ∈ H for almost all (x, y) ∈ X × Y . Then P r obz∈(X×Y )m {S(z, H ) ≤ ǫ} ≤ 1 − δ 2 where δ = Cov#(H , ǫ/24M)e−mǫ/288M . The result is Theorem C ∗ of [4], but earlier versions (usually without a topology on H ) have been proved by others, especially Vapnik, who formulated the notion of VC dimension to measure the complexity of the hypothesis space for the case of {0, 1} functions. In a typical case of Theorem 1 the hypothesis space H is taken to be as in equation (10), where BR is the ball of radius R in an RKHS with a smooth K (or in a Sobolev space). In this context R plays an analogous role to VC dimension [16]. Estimates for the covering numbers in these cases were provided by Cucker, Smale, and Zhou [1]. The proof of Theorem 1 starts from the Hoeffding inequality (which can be regarded as an exponential version of Chebyshev’s inequality of probability theory). One applies this estimate to the function X × Y → R which takes (x, y) to (f (x) − y)2 . Then extending the estimate to the set of f ∈ H introduces the covering number into the picture. With a little more work, Theorem 1 is obtained. Approximation Error The approximation error X (fH − fρ )2 may be studied as follows. Suppose B : L2 → L2 is a compact, strictly positive (selfadjoint) operator. Then let E be the Hilbert space AMS VOLUME 50, NUMBER 5 {g ∈ L2 , B −s g < ∞} with inner product g, hE = B −s g, B −s hL2 . Suppose moreover that E → L2 factors as E → C(X) → L2 with the inclusion JE : E ֓ C(X) well defined and compact. Let H be JE (BR ) when BR is the ball of radius R in E . A theorem on the approximation error is Theorem 2. Let 0 < r < s and H be as above. Then fρ − fH 2 2r ≤ (1/R) s−r B −r fρ 2s s−r . We now use · for the norm in the space of square-integrable functions on X, with measure 1/2 ρX . For our main example of RKHS, take B = LK , where K is a Mercer kernel, (15) LK f (x) = f (x′ )K(x, x′ ), X and we have taken the square root of the operator LK . In this case E is HK as above. Details and proofs may be found in [4] and in [15]. Sample and Approximation Error for the Regularization Algorithm The previous discussion depends upon a compact hypothesis space H from which the experimental optimum fz and the true optimum fH are taken. In the key algorithm of the preceding section, the optimization is done over all f ∈ HK with a regularized error function. The error analysis of the preceding two subsections must therefore be extended. Thus let fγ,z be the empirical optimum for the regularized problem as exhibited in equation (4): m 1 (yi − f (xi ))2 + γ f m i=1 2 K. Then (16) (fγ,z − fρ )2 ≤ S(γ) + A(γ) where A(γ) (the approximation error in this context) is (17) −1 A(γ) = γ 1/2 LK 4 fρ 2 and the sample error is (18) S(γ) = 32M 2 (γ + C)2 ∗ v (m, δ) γ2 where v ∗ (m, δ) is the unique solution of (19) m 3 4m v − ln( )v − c1 = 0. 4 δ Here C, c1 > 0 depend only on X and K. For the proof one reduces to the case of compact H and applies Theorems 1 and 2. Thus finding the optimal solution is equivalent to finding the best tradeoff MAY 2003 between A and S for a given m. In our case, this bias-variance problem is to minimize S(γ) + A(γ) over γ > 0 . There is a unique solution—a best γ— for the choice in equation (4). For this result and its consequences see [5]. Remarks The Tradeoff between Sample Complexity and Hypothesis Space Complexity For a given fixed hypothesis space H , only the sample error component of the error of fz can be controlled (in equation (12) only S(z, H ) depends on the data). In this view, convergence of S to zero as the number of data points increases (Theorem 1) is then the central problem in learning. Vapnik called consistency of ERM (i.e., convergence of the empirical error to the true error) the key problem in learning theory, and in fact much modern work has focused on refining the necessary and sufficient conditions for consistency of ERM (the uniform Glivenko-Cantelli property of H , finite Vγ dimension for γ > 0 , etc.; see [8]). More generally, however, there is a tradeoff between minimizing the sample error and minimizing the approximation error—what we referred to as the bias-variance problem. Increasing the number m of data points decreases the sample error. The effect of increasing the complexity of the hypothesis space is trickier. Usually the approximation error decreases but the sample error increases. This means that there is an optimal complexity of the hypothesis space for a given number of training data. In the case of the regularization algorithm described in this paper, this tradeoff corresponds to an optimum value for γ as studied in [3], [5], [11]. In empirical work, the optimum value is often found through cross-validation techniques [18]. This tradeoff between approximation error and sample error is probably the most critical issue in determining good performance on a given problem. The class of regularization algorithms, such as equation (4), shows clearly that it is also a tradeoff— quoting Girosi—between the curse of dimensionality (not enough examples) and the blessing of smoothness (which decreases the effective “dimensionality”, i.e., the complexity of the hypothesis space) through the parameter γ. The Regularization Algorithm and Support Vector Machines There is nothing to stop us from using the algorithm we described in this paper—that is, square loss regularization—for binary classification. Whereas SVM classification arose from using—with binary y —the loss function V (f (x, y)) = (1 − yf (x))+ , we can perform least-squares regularized classification via the loss function NOTICES OF THE AMS 541 800 SVM RLSC 0.131 0.129 250 SVM RLSC 0.167 0.165 100 SVM RLSC 0.214 0.211 Table 1. A comparison of SVM and RLSC (Regularized Least Squares Classification) accuracy on a multiclass classification task (the 20newsgroups dataset with 20 classes and high dimensionality, around 50,000), performed using the standard “one versus all” scheme based on the use of binary classifiers. The top row indicates the number of documents/class used for training. Entries in the table are the fraction of misclassified documents. From [14]. SVM 0.072 52 RLSC 0.066 SVM 0.176 20 RLSC 0.169 SVM 0.341 10 RLSC 0.335 Table 2. A comparison of SVM and RLSC accuracy on another multiclass classification task (the sector105 dataset, consisting of 105 classes with dimensionality about 50,000). The top row indicates the number of documents/class used for training. Entries in the table are the fraction of misclassified documents. From [14]. V (f (x, y)) = (f (x) − y)2 . This classification scheme was used at least as early as 1989 (for reviews see [13]) and then rediscovered again by many others (see [1]), including Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square SVMs”). Rifkin [14] has confirmed the interesting empirical results by Mangasarian and Suykens: “classical” square loss regularization works well also for binary classification (examples are in Tables 1 and 2). In references to supervised learning, the Support Vector Machine method is often described (see for instance R. M. Karp’s article in the May 2002 issue of the Notices) according to the “traditional” approach, introduced by Vapnik and followed by almost everybody else. In this approach, one starts with the concepts of separating hyperplanes and margin. Given the data, one searches for the linear hyperplane that separates the positive and the negative examples, assumed to be linearly separable, with the largest margin (the margin is defined as the distance from the hyperplane to the nearest example). Most articles and books follow this approach, go from the separable to the nonseparable case, and use a so-called “kernel trick” (!) to extend it to the nonlinear case. SVM for classification was introduced by Vapnik in the linear, separable case in terms of maximizing the margin. In the nonseparable case, the margin motivation loses most of its meaning. A more general and simpler framework for deriving SVM algorithms 542 NOTICES OF THE for classification and regression is to regard them as special cases of regularization and follow the treatment of the section above on the key algorithm. In the case of linear functions f (x) = w · x and separable data, maximizing the margin is exactly equivalent to maximizing 1/ w , which is in turn equivalent to minimizing w 2 , which corresponds to minimizing the RKHS norm. The Regularization Algorithm and Learning Theory The Mercer theorem was introduced in learning theory by Vapnik; RKHS were explicitly introduced in learning theory by Girosi and later by Vapnik [1], [16]. Poggio and Girosi [13], [1] had introduced Tikhonov regularization in learning theory. Earlier, Gaussian Radial Basis Functions were proposed as an alternative to neural networks by Broomhead and Loewe. Of course, RKHS had been pioneered by Parzen and Wahba ([12], [18]; for a review see [18]) for applications closely related to learning, including data smoothing (for image processing and computer vision, see [1]). A Bayesian Interpretation The learning algorithm equation (4) has an interesting Bayesian interpretation [18]: the data term—that is, the first term with the quadratic loss function—is a model of (Gaussian, additive) noise, and the RKHS norm (the stabilizer) corresponds to a prior probability on the hypothesis space H . Let us define P [f |Sm ] as the conditional probability of the function f given the training examples Sm = (xi , yi )m i=1 , P [Sm |f ] as the conditional probability of Sm given f , i.e., a model of the noise, and P [f ] as the a priori probability of the random field f . Then Bayes’s theorem provides the posterior distribution as P [Sm |f ] P [f ] P [f |Sm ] = . P (Sm ) If the noise is normally distributed with variance σ , then the probability P [Sm |f ] is P [Sm |f ] = 1 − 1 2 m (y −f (xi ))2 e 2σ i=1 i ZL where ZL is a normalization constant. 2 1 If P [f ] = Zr e− f K , where Zr is another normalization constant, then P [f |Sm ] = 1 − e ZD ZL Zr 1 2σ 2 m 2 i=1 (yi −f (xi )) + f 2 K . One of the several possible estimates of f from P [f |Sm ] is the so-called Maximum A Posteriori (MAP) estimate, that is, m max P [f |Sm ] = min (yi − f (xi ))2 + 2σ 2 f 2K , i=1 which is the same as the regularization functional if λ = 2σ 2 /m (for details and extensions to models AMS VOLUME 50, NUMBER 5 of non-Gaussian noise and different loss functions, see [8]). Necessary and Sufficient Conditions for Learnability Compactness of the hypothesis space H is sufficient for consistency of ERM, that is, for bounds of the type of Theorem 1 on the sample error. The necessary and sufficient condition is that H is a uniform Glivenko-Cantelli class of functions, in which case no specific topology is assumed for H .2 There are several equivalent conditions on H such as finiteness of the Vγ dimension for all positive γ (which reduces to finiteness of the VC dimension for {0, 1} functions).3 We saw earlier that the regularization algorithm equation (4) ensures (through the resulting compactness of the “effective” hypothesis space) well-posedness of the problem. It also yields convergence of the empirical error to the true error (i.e., bounds such as Theorem 1). An open question is whether there is a connection between wellposedness and consistency. For well-posedness the critical condition is usually stability of the solution. In the learning problem, this condition refers to stability of the solution of ERM with respect to small changes of the training set Sm . In a similar way, the condition number characterizes the stability of the solution of equation (3). Is it possible that some specific form of stability may be necessary and sufficient for consistency of ERM? Such a result would be surprising, because, a priori, there is no reason why there should be a connection between well-posedness and consistency; they are both important requirements for ERM, but they seem quite different and independent of each other. Learning Theory, Sample Complexity, and Brains The theory of supervised learning outlined in this paper and in the references has achieved a 2Definition: A class F of functions f is a uniform Glivenko- Cantelli class if for every ε > 0 lim sup P{sup|Eρm f − Eρ f | > ε} = 0, m→∞ ρ f ∈F where ρn is the empirical measure supported on a set x1 , . . . , xn . 3In [2], following [17], a necessary and sufficient condition is proved for uniform convergence of |Iemp [f ] − Iexp [f ]| , in terms of the finiteness for all γ > 0 of a combinatorial quantity called the Vγ dimension of F (which is the set V (x), f (x), f ∈ H ), under some assumptions on V . The result is based on a necessary and sufficient (distribution independent) condition which uses the metric entropy of F defined Hm (ǫ, F ) = supxm ∈X m log N (ǫ, F , xm ) , as where N (ǫ, F , xm ) is the ǫ -covering of F with respect to lx∞m ( lx∞m is the l ∞ distance on the points xm ): Theorem [Dudley, Giné, and Zinn]. F is a uniform GlivenkoCantelli class if and only if limm→∞ Hm (ǫ, F )/m = 0 for all ǫ > 0 . MAY 2003 remarkable degree of completeness and of practical success in many applications. Within it, many interesting problems remain open and are a fertile ground for interesting and useful mathematics. One may also take a broader view and ask, What next? One could argue that the most important aspect of intelligence and of the amazing performance of real brains is the ability to learn. How then do the learning machines we have described in the theory compare with brains? There are of course many aspects of biological learning that are not captured by the theory and several difficulties in making any comparison. One of the most obvious differences, however, is the ability of people and animals to learn from very few examples. The algorithms we have described can learn an object recognition task from a few thousand labeled images. This is a small number compared with the apparent dimensionality of the problem (millions of pixels), but a child, or even a monkey, can learn the same task from just a few examples. Of course, evolution has probably done a part of the learning, but so have we, when we choose for any given task an appropriate input representation for our learning machine. From this point of view, as Donald Geman has argued, the interesting limit is not “m goes to infinity”, but rather “m goes to zero”. Thus an important area for future theoretical and experimental work is learning from partially labeled examples (and the related area of active learning). In the first case there are only a small number ℓ of labeled pairs (xi , yi )ℓi=1 —for instance, with yi binary—and many unlabeled data (xi )m ℓ+1 , m ≫ ℓ . Though interesting work has begun in this direction, a satisfactory theory that provides conditions under which unlabeled data can be used is still lacking. A comparison with real brains offers another, and probably related, challenge to learning theory. The “learning algorithms” we have described in this paper correspond to one-layer architectures. Are hierarchical architectures with more layers justifiable in terms of learning theory? It seems that the learning theory of the type we have outlined does not offer any general argument in favor of hierarchical learning machines for regression or classification. This is somewhat of a puzzle, since the organization of cortex—for instance, visual cortex— is strongly hierarchical. At the same time, hierarchical learning systems show superior performance in several engineering applications. For instance, a face categorization system in which a single SVM classifier combines the real-valued output of a few classifiers, each trained to a different component of faces (such as eye and nose), outperforms a single classifier trained on full images of faces [1]. The theoretical issues surrounding hierarchical systems of this type are wide open and likely to be of NOTICES OF THE AMS 543 paramount importance for the next major development of efficient classifiers in several application domains. Why hierarchies? There may be reasons of efficiency—computational speed and use of computational resources. For instance, the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks (see [9]). Hierarchical systems usually decompose a task in a series of simple computations at each level—often an advantage for fast implementations. There may also be the more fundamental issue of sample complexity. We mentioned that an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. The theory described in this paper shows that the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning. If a task—like a visual recognition task—can be decomposed into low-complexity learning tasks, for each layer of a hierarchical learning machine, then each layer may require only a small number of training examples. Of course, not all classification tasks have a hierarchical representation. Roughly speaking, the issue is under which conditions a function of many variables can be approximated by a function of a small number of functions of subsets of the original variables. Neuroscience suggests that what humans can learn can be represented by hierarchies that are locally simple. Thus our ability of learning from just a few examples, and its limitations, may be related to the hierarchical architecture of cortex. This is just one of several possible connections, still to be characterized, between learning theory and the ultimate problem in natural science—the organization and the principles of higher brain functions. [4] F. CUCKER and S. SMALE, On the mathematical foundations of learning, Bull. Amer. Math. Soc. (N.S.) 39 (2001), 1–49. [5] ——— , Best choices for regularization parameters in learning theory: On the bias-variance problem, Foundations Comput. Math. 2 (2002), no. 4, 413–428. [6] R. A. DEVORE, Nonlinear approximation, Acta Numer. 7 (1998), 51–150. [7] L. DEVROYE, L. GYÖRFI, and G. LUGOSI, A Probablilistic Theory of Pattern Recognition, Appl. Math., vol. 31, Springer, New York, 1996. [8] T. EVGENIOU, M. PONTIL, and T. POGGIO, Regularization networks and support vector machines, Adv. Comput. Math. 13 (2000), 1–50. [9] T. HASTIE, R. TIBSHIRANI, and J. FRIEDMAN, The Elements of Statistical Learning, Springer-Verlag, Basel, 2001. [10] C. A. MICCHELLI, Interpolation of scattered data: Distance matrices and conditionally positive functions, Constr. Approx. 2 (1986), 11–22. [11] P. NIYOGI and F. GIROSI, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996), 819–842. [12] E. PARZEN, An approach to time series analysis, Amer. Math. Statist. 32 (1961), 951–989. [13] T. POGGIO and F. GIROSI, Networks for approximation and learning, Proc. IEEE 78 (1990), no. 9. [14] R. M. RIFKIN, Everything old is new again: A fresh look at historical approaches to machine learning, Ph.D. thesis, Massachusetts Institute of Technology, 2002. [15] S. SMALE and D. ZHOU, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003), 1–25. [16] V. N. VAPNIK, Statistical Learning Theory, Wiley, New York, 1998. [17] V. N. VAPNIK and A. Y. CHERVONENKIS, On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab. Appl. 17 (1971), no. 2, 264–280. [18] G. WAHBA, Splines Models for Observational Data, Series in Applied Mathematics, vol. 59, SIAM, Philadelphia, PA, 1990. Acknowledgments Thanks to Felipe Cucker, Federico Girosi, Don Glaser, Sayan Mukherjee, Martino Poggio, and Ryan Rifkin. References [1] For a full list of references, including specific applications mentioned in the article, see http:// www.ai.mit.edu/projects/cbcl/projects/ NoticesAMS/PoggioSmale.htm. [2] N. ALON, S. BEN-DAVID, N. CESA-BIANCHI, and D. HAUSSLER, Scale-sensitive dimensions, uniform convergence, and learnability, J. ACM 44 (1997), no. 4, 615–631. [3] A. R. BARRON, Approximation and estimation bounds for artificial neural networks, Machine Learning 14 (1994), 115–133. 544 NOTICES OF THE AMS VOLUME 50, NUMBER 5

Log In

The Mathematics of Learning: Dealing with Data

Related papers

Related papers

Related topics