Academia.eduAcademia.edu

On Learnability, Complexity and Stability

2013

We consider the fundamental question of learnability of a hypotheses class in the supervised learning setting and in the general learning setting introduced by Vladimir Vapnik. We survey classic results characterizing learnability in term of suitable notions of complexity, as well as more recent results that establish the connection between learnability and stability of a learning algorithm.

arXiv:1303.5976v1 [stat.ML] 24 Mar 2013 On Learnability, Complexity and Stability Silvia Villa† , Lorenzo Rosasco†,§ , Tomaso Poggio⊤,† ⊤ CBCL, Massachusetts Institute of Technology † LCSL , Istituto Italiano di Tecnologia and Massachusetts Institute of Technology § DIBRIS, Universita’ degli Studi di Genova [email protected], {lrosasco,tp}@mit.edu March 26, 2013 Abstract We consider the fundamental question of learnability of a hypotheses class in the supervised learning setting and in the general learning setting introduced by Vladimir Vapnik. We survey classic results characterizing learnability in term of suitable notions of complexity, as well as more recent results that establish the connection between learnability and stability of a learning algorithm. 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly speaking, a hypotheses space is learnable if there is a consistent learning algorithm, i.e. one returning an optimal solution as the number of sample goes to infinity. Classic results for supervised learning characterize learnability of a function class in terms of its complexity (combinatorial dimension) [17, 16, 1, 2, 9, 3]. Indeed, minimization of the empirical risk on a function class having finite complexity can be shown to be consistent. A key aspect in this approach is the connection with empirical process theory results showing that finite combinatorial dimensions characterize function classes for which a uniform law of large numbers holds, namely uniform Glivenko-Cantelli classes [7]. More recently, the concept of stability has emerged as an alternative and effective method to design consistent learning algorithms [4]. Stability refers broadly to continuity properties of learning algorithm to its input and it is known to play a crucial role in in regularization theory [8]. Surprisingly, for certain classes of loss functions, a suitable notion of stability of ERM can be shown to characterize learnability of a function class [10, 12, 11]. In this paper, after recalling some basic concepts (Section 2), we review results characterizing learnability in terms of complexity and stability in supervised learning (Section 3) and in the so called general learning (Section 4). We conclude with some remarks and open questions. 1 2 Supervised Learning, Consistency and Learnability In this section, we introduce basic concepts in Statistical Learning Theory (SLT). First, we describe the supervised learning setting, and then, define the notions of consistency of a learning algorithm and of learnability of a hypotheses class. Consider a probability space (Z , ρ ), where Z = X × Y , with X a measurable space and Y a closed subset of R. A loss function is a measurable map ℓ : R × Y → [0, +∞). We are interested in the problem of minimizing the expected risk, inf Eρ , Eρ ( f ) = F Z X ×Y ℓ( f (x), y) d ρ (x, y), (1) where F ⊂ Y X is the set of measurable functions from X to Y (endowed with the product topology and the corresponding Borel σ -algebra). The probability distribution ρ is assumed to be fixed but known only through a training set, i.e. a set of pairs zn = ((x1 , y1 ), . . . , (xn , yn )) ∈ Z n sampled identically and independently according to ρ . Roughly speaking, the problem of supervised learning is that of approximatively solving Problem (1) given a training set zn . Example 1 (Regression and Classification) In (bounded) regression Y is a bounded interval in R, while in binary classification Y = {0, 1}. Examples of loss functions are the square loss ℓ(t, y) = (t − y)2 in regression and the misclassification loss ℓ(t, y) = 1{t6=y} in classification. See [16] for a more exhaustive list of loss functions. In the next section, the notion of approximation considered in SLT is defined rigorously. We first introduce the concepts of hypotheses space and learning algorithm. Definition 1 A hypotheses space is a set of functions H ⊆ F . We say that H is universal if infF Eρ = infH Eρ , for all distributions ρ on Z . Definition 2 A learning algorithm A on H is a map, A: [ Zn→H , zn 7→ Azn = A(zn ), n∈N such that, for all n ≥ 1, A|Z n is measurable with respect to the completion of the product σ -algebra on Z n . Empirical Risk Minimization (ERM) is arguably the most popular example of learning algorithm in SLT. Example 2 Given a training set zn the empirical risk Ezn : F → R is defined as Ezn ( f ) = 1 n ∑ ℓ( f (xi ), yi ). n i=1 Given a hypotheses space H , ERM on H is defined by minimization of the empirical risk on H . We add one remark. 2 Remark 1 (ERM and Asymptotic ERM) In general some care is needed while defining ERM since a (measurable) minimizer might not be ensured to exist. When Y = {0, 1} and ℓ is the misclassification loss function, it is easy to see that a minimizer exists (possibly non unique). In this case measurability is studied for example in Lemma 6.17 in [15]. When considering more general loss functions or regression problems one might need to consider learning algorithms defined by suitable (measurable) almost minimizers of the empirical risk (see e.g. Definition 10). 2.1 Consistency and Learnability Aside from computational considerations, the following definition formalizes in which sense a learning algorithm approximatively solves Problem (1). Definition 3 We say that a learning algorithm A on H is uniformly consistent1 if  ∀ε > 0, lim sup ρ n {zn : Eρ (Azn ) − inf Eρ > ε } = 0, n→+∞ ρ H and universally uniformly consistent if H is universal. The next definition shifts the focus from a learning algorithm on H , to H itself. Definition 4 We say that a space H is uniformly learnable if there exists a uniformly consistent learning algorithm on H . If H is also universal we say that it is universally uniformly learnable. Note that, in the above definitions, the term “uniform” refers to the distribution for which consistency holds, whereas “universal” refers to the possibility of solving Problem (1) without a bias due to the choice of H . The requirement of uniform learnability implies the existence of a learning rate for A [15] or equivalently a bound on the sample complexity [2]. The following classical result, sometimes called the ”no free lunch” theorem, shows that uniform universal learnability of a hypotheses space is too much to hope for. Theorem 1 Let Y = {0, 1}, and X such that there exists a measure µ on X having an atom-free distribution. Let ℓ be the misclassification loss. If H is universal, then H is not uniformly learnable. The proof of the above result is based on Theorem 7.1 in [6], which shows that for each learning algorithm A on H and any fixed n, there exists a measure ρ on X × Y such that the expected value of Eρ (Azn ) − infH Eρ is greater than 1/4. A general form of the no free lunch theorem, beyond classification, is given in [15] (see Corollary 6.8). In particular, this result shows that the no free lunch theorem holds for convex loss functions, as soon as there are two probability distributions ρ1 , ρ2 such that infH Eρ1 6= infH Eρ2 (assuming that minimizers exist). Roughly speaking, if there exist two learning problems with distinct solutions, then H cannot be universal uniformly learnable (this latter condition becomes more involved when the loss is not convex). 1 Consistency can de defined with respect to other convergence notions for random variables. If the loss function is bounded, convergence in probability is equivalent to convergence in expectation. 3 The no free lunch theorem shows that universal uniform consistency is too strong of a requirement. Restrictions on either the class of considered distributions ρ or the hypotheses spaces/algorithms are needed to define a meaningful problem. In the following, we will follow the latter approach where assumptions on H (or A), but not on the class distributions ρ , are made. 3 Learnability of a Hypotheses space In this section we study uniform learnability by putting appropriate restrictions on the hypotheses space H . We are interested in conditions which are not only sufficient but also necessary. We discuss two series of results. The first is classical and characterizes learnability of a hypotheses space in terms of suitable complexity measures. The second, more recent, is based on the stability (in a suitable sense) of ERM on H . 3.1 Complexity and Learnability Classically assumptions on H are imposed in the form of restrictions on its ”size” defined in terms of suitable notions of combinatorial dimensions (complexity). The following definition of complexity for a class of binary valued functions has been introduced in [17]. Definition 5 Assume Y = {0, 1}. We say that H shatters S ⊆ X if for each E ⊆ S there exists fE ∈ H such that fE (x) = 0, if x ∈ E, and fE (x) = 1 is x ∈ S \ E. The VC-dimension of H is defined as VC(H ) = max{d ∈ N : ∃ S = {x1 , . . . xd } shattered by H } The VC-dimension turns out to be related to a special class of functions, called uniform Glivenko-Cantelli, for which a uniform form of the law of large numbers holds [7]. Definition 6 We say that H is a uniform Glivenko-Cantelli (uGC) class if it has the following property n o = 0. lim sup ρ n zn : sup Eρ ( f ) − Ezn ( f ) > ε ∀ε > 0, n→+∞ ρ f ∈H The following theorem completely characterizes learnability in classification. Theorem 2 Let Y = {0, 1} and ℓ be the misclassification loss. Then the following conditions are equivalent: 1. H is uniformly learnable, 2. ERM on H is uniformly consistent, 3. H is a uGC-class, 4. the VC-dimension of H is finite. 4 The proof of the above result can be found for example in [2] (see Theorems 4.9, 4.10 and 5.2). The characterization of uGC classes in terms of combinatorial dimensions is a central theme in empirical process theory [7]. The results on binary valued functions are essentially due to Vapnik and Chervonenkis [17]. The proof that uGC of H implies its learnability is straightforward. The key step in the above proof is showing that learnability is sufficient for finite VC-dimension, i.e. VC(H ) < ∞. The proof of this last step crucially depends on the considered loss function. A similar result holds for bounded regression with the square [1, 2] and absolute loss functions [9, 3]. In this case, a new notion of complexity needs to be defined since the VC-dimension of real valued function classes is not defined. Here, we recall the definition of γ -fat shattering dimension of a class of functions H originally introduced in [9]. Definition 7 Let H be a set of functions from X to R and γ > 0. Consider S = {x1 , . . . , xd } ⊂ X . Then S is γ -shattered by H if there are real numbers r1 , . . . , rd such that for each E ⊆ S there is a function fE ∈ H satisfying ( fE (x) ≤ ri − γ ∀x ∈ S \ E fE (x) ≥ ri + γ ∀x ∈ E. We say that (r1 , . . . , rd ) witnesses the shattering. The γ -fat shattering dimension of H is fatH (γ ) = max{d : ∃ S = {x1 , . . . , xd } ⊆ X s.t. S is γ -shattered by H }. As mentioned above, an analogous of Theorem 2 can be proved for bounded regression with the square and absolute losses, if condition 4) is replaced by fatH (γ ) < +∞ for all γ > 0. We end noting that is an open question proving that the above results holds for loss function other than the square and absolute loss. 3.2 Stability and Learnability In this section we show that learnability of a hypotheses space H is equivalent to the stability (in a suitable sense) of ERM on H . It is useful to introduce the following notation. For a given loss function ℓ, let L : F × Z → [0, ∞) be defined as L( f , z) = ℓ( f (x), y), for f ∈ F and z = (x, y) ∈ Z . Moreover, let zin be the training zn with the i-th point removed. With the above notation, the relevant notion of stability is given by the following definition. Definition 8 A learning algorithm A on H is uniformly CVloo stable if there exist sequences (βn , δn )n∈N such that βn → 0, δn → 0 and sup ρ n {|L(Azin , zi ) − L(Azn , zi )| ≤ βn } ≥ 1 − δn , ρ for all i ∈ {1, . . . , n}. 5 (2) Before illustrating the implications of the above definition to learnability we first add a few comments and historical remarks. We note that, in a broad sense, stability refers to a quantification of the continuity of a map with respect to its input. The key role of stability in learning has long been advocated on the basis of the interpretation of supervised learning as an ill-posed inverse problems [11]. Indeed, the concept of stability is central in the theory of regularization of ill-posed problem [8]. A first quantitative connection between the performance of a symmetric learning algorithm2 and a notion of stability is derived in the seminal paper [4]. Here a notion of stability, called uniform stability, is shown to be sufficient for consistency. If we let zni,u be the training zn with the i-th point replaced by u, uniform stability is defined as, |L(Azi,u , z) − L(Azn , z)| ≤ βn , n (3) for all zn ∈ Z n , u, z ∈ Z n and i ∈ {1, . . . , n}. A thorough investigation of weaker notions of stability is given in [10]. Here, many different notions of stability are shown to be sufficient for consistency (and learnability) and the question is raised of whether stability (of ERM on H ) can be shown to be necessary for learnability of H . In particular a definition of CV stability for ERM is shown to be necessary and sufficient for learnability in a Probably Approximate Correct (PAC) setting, that is when Y = {0, 1} and for some h∗ ∈ H , y = h∗ (x), for all x ∈ X . Finally, Definition 8 of CVloo stability is given and studied in [11]. When compared to uniform stability, we see that: 1) the “replaced one” training set zni,u is considered instead of the “leave one out” training set zin ; 2) the error is evaluated on the point zi which is left out, rather than any possible z ∈ Z ; finally 3) the condition is assumed to hold for a fraction 1 − δn of training sets (which becomes increasingly larger as n increases) rather than uniformly for any training set zn ∈ Z n . The importance of CVloo stability is made clear by the following result. Theorem 3 Let Y = {0, 1} and ℓ be the misclassification loss function. Then the following conditions are equivalent, 1. H is uniformly learnable, 2. ERM on H is CVloo stable The proof of the above result is given in [11] and is based on essentially two steps. The first is proving that CVloo stability of ERM on H implies that ERM is uniformly consistent. The second is showing that if H is a uGC class then ERM on H is CVloo stable. Theorem 3 then follows from Theorem 2 (since uniform consistency of ERM on H and H being uGC are equivalent). Both steps in the above proof can be generalized to regression as long as the loss function is assumed to be bounded. The latter assumption holds for example if the loss function satisfies a suitable Lipschitz condition and Y is compact (so that H is a set of uniformly bounded functions). However, generalizing Theorem 3 beyond classification requires the generalization of Theorem 2. For the the square and absolute loss functions and Y compact, the characterization of learnability in terms of γ -fat 2 We say that a learning algorithm A is symmetric if it does not depend on the order of the points in zn . 6 shattering dimension can be used. It is an open question whether there is a more direct way to show that learnability is sufficient for stability, independently to Theorem 2 and to extend the above results to more general classes of loss functions. We will see a partial answer to this question in Section 4. 4 Learnability in the General Learning Setting In the previous sections we focused our attention on supervised learning. Here we ask whether the results we discussed extend to the so called general learning [16]. Let (Z , ρ ) be a probability space and F a measurable space. A loss function is a map L : F × Z → [0, ∞), such that L( f , ·) is measurable for all f ∈ F . We are interested in the problem of minimizing the expected risk, inf Eρ , H Eρ ( f ) = Z Z L( f , z) d ρ (z), (4) when ρ is fixed but known only through a training set, zn = (z1 , . . . , zn ) ∈ Z n sampled identically and independently according to ρ . Definition 2 of a learning algorithm on H applies as is to this setting and ERM on H is defined by the minimization of the empirical risk 1 n Ezn ( f ) = ∑ L( f , zi ). n i=1 While general learning is close to supervised learning, there are important differences. The data space Z has no natural decomposition, F needs not to be a space of functions. Indeed, F and Z are related only via the loss function L. For our discussion it is important to note that the distinction between F and the hypotheses space H becomes blurred. In supervised learning F is the largest set of functions for which Problem (1) is well defined (measurable functions in Y X ). The choice of a hypotheses corresponds intuitively to a more ”manageable” function space. In general learning the choice of F is more arbitrary as a consequence the the definition of universal hypotheses space is less clear. The setting is too general for an analogue of the no free lunch theorem to hold. Given these premises, in what follows we will simply identify F = H and consider the question of learnability, noting that the definition of uniform learnability extends naturally to general learning. We present two sets of ideas. The first, due to Vapnik, focuses on a more restrictive notion of consistency of ERM. The second, investigates the characterization of uniform learnability in terms of stability. 4.1 Vapnik’s Approach and Non Trivial Consistency The extension of the classical results characterizing learnability in terms of complexity measure is tricky. Since H is not a function space the definitions of VC or Vγ dimensions do not make sense. A possibility is to consider the class L ◦ H := {z ∈ Z 7→ L( f , z) for some f ∈ H } and the corresponding VC dimension (if L is binary valued) or Vγ dimension (if L is real valued). Classic results about the equivalence between the uGC property and finite complexity apply to the class L ◦ H . Moreover, uniform 7 learnability can be easily proved if L ◦ H is a uGC class. On the contrary, the reverse implication does not hold in the general learning setting. A counterexample is given in [16] (Sec. 3.1) showing that it is possible to design hypotheses classes with infinite VC (or Vγ ) dimension, which are uniformly learnable with ERM. The construction is as follows. Consider an arbitrary set H and a loss L for which the class L ◦ H has f:= H ∪ {h̃} by adding to H an infinite VC (or Vγ ) dimension. Define a new space H element h̃ such that L(h̃, z) ≤ L(h, z) for all z ∈ Z and h ∈ H 3 . The space L ◦ Hf has infinite VC, or Vγ , dimension and is trivially learnable by ERM, which is constant and coincides with h̃ for each probability measure ρ . The previous counterexample proves that learnability, and in particular learnability via ERM, does not imply finite VC or Vγ dimension. To avoid these cases of “trivial consistency” and to restore the equivalence between learnability and finite dimension, the following stronger notion of consistency for ERM has been introduced by Vapnik [16]. Definition 9 ERM on H is strictly uniformly consistent if and only if ∀ε > 0, lim sup ρ n ( inf Ezn ( f ) − inf Eρ ( f ) > ε ) = 0, n→∞ ρ Hc Hc where Hc = { f ∈ H : Eρ ( f ) ≥ c}. The following result characterizes strictly uniform consistency in terms of uGC property of the class L ◦ H (see Theorem 3.1 and its Corollary in [16]]) Theorem 4 Let B > 0 and assume L( f , z) ≤ B for all f ∈ H and z ∈ Z . Then the following conditions are equivalent, 1. ERM on H is strictly consistent, 2. L ◦ H is a uniform one-sided Glivenko-Cantelli class. The definition of one-sided Glivenko-Cantelli class simply corresponds to omitting the absolute value in Definition 6. 4.2 Stability and Learnability for General Learning In this section we discuss ideas from [14] extending the stability approach to general learning. The following definitions are relevant. Definition 10 A uniform Asymptotic ERM (AERM) algorithm A on H is a learning algorithm such that ∀ε > 0, lim sup ρ n ({zn : Ezn (Azn ) − inf Ezn > ε }) = 0. n→∞ ρ H Definition 11 A learning algorithm A on H is uniformly replace one (RO) stable if there exists a sequence βn → 0 such that 1 n ∑ |L(Azi,un , z) − L(Azn , z)| ≤ βn . n i=1 for all zn ∈ Z n , u, z ∈ Z n and i ∈ {1, . . . , n}. 3 Note that this construction is not possible in classification or in regression with the square loss. 8 Note that the above definition is close to that of uniform stability (3), although the latter turns out to be a stronger condition. The importance of the above definitions is made clear by the following result. Theorem 5 Let B > 0 and assume L( f , z) ≤ B for all f ∈ H and z ∈ Z . Then the following conditions are equivalent, 1. H is uniformly learnable, 2. there exists an AERM algorithm on H which is RO stable. As mentioned in Remark 1, Theorem 3 holds not only for exact minimizers of the empirical risk, but also for AERM. In this view, there is a subtle difference between Theorem 3 and Theorem 5. In supervised learning, Theorem 3 shows that uniform learnability implies that every ERM (AERM) is stable, while in general learning, Theorem 5 shows that uniform learnability implies the existence of a stable AERM (whose construction is not explicit). The proof of the above result is given in Theorem 7 in [14]. The hard part of the proof is showing that learnability implies existence of a RO stable AERM. This part of the proof is split in two steps showing that: 1) if there is a uniformly consistent algorithm A, then there exists a uniformly consistent AERM A′ (Lemma 20 and Theorem 10); 2) every uniformly consistent AERM is also RO stable (Theorem 9). Note that the results in [14] are given in expectation and with some quantification of how different convergence rates are related. Here we give results in probability to be uniform with the rest of the paper and state only asymptotic results to simplify the presentation. 5 Discussion In this paper we reviewed several results concerning learnability of a hypotheses space. Extensions of these ideas can be found in [5] (and references therein) for multi-category classification, and in [13] for sequential prediction. It would be interesting to devise constructive proofs in general learning suggesting how stable learning algorithms can be designed. Moreover, it would be interesting to study universal consistency and learnability in the case of samples from non stationary processes. References [1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997. [2] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge University Press, Cambridge, 1999. [3] P. Bartlett, P. Long, and R. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. 9 [4] O. Bousquet and A. Elisseeff. Stability and generalization,. Journal of Machine Learning Research, 2:499–526, 2002. [5] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. Journal of Machine Learning Research Proceedings Track, 19:207–232, 2011. [6] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of mathematics. Springer, New York, 1996. [7] R. Dudley, E. Giné, and J. Zinn. Uniform and universal Glivenko-Cantelli classes. J. Theoret. Prob., 4:485–510, 1991. [8] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375 of Mathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996. [9] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. In Computational learning theory and natural learning systems, Vol. I, Bradford Book, pages 289–329. MIT Press, Cambridge, MA, 1994. [10] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical Report TR-2002-03, Department of Computer Science, The University of Chicago, 2002. [11] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25(1-3):161–193, 2006. [12] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419–422, 2004. [13] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Beyond regret. Journal of Machine Learning Research - Proceedings Track, 19:559– 594, 2011. [14] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, 2010. [15] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics. Springer, New York, 2008. [16] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. [17] V. N. Vapnik and A. Y. Chervonenkis. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemeh., (2):42–53, 1971. 10