arXiv:1303.5976v1 [stat.ML] 24 Mar 2013
On Learnability, Complexity and Stability
Silvia Villa† , Lorenzo Rosasco†,§ , Tomaso Poggio⊤,†
⊤ CBCL, Massachusetts Institute of Technology
† LCSL , Istituto Italiano di Tecnologia and Massachusetts Institute of Technology
§ DIBRIS, Universita’ degli Studi di Genova
[email protected], {lrosasco,tp}@mit.edu
March 26, 2013
Abstract
We consider the fundamental question of learnability of a hypotheses class in
the supervised learning setting and in the general learning setting introduced by
Vladimir Vapnik. We survey classic results characterizing learnability in term of
suitable notions of complexity, as well as more recent results that establish the
connection between learnability and stability of a learning algorithm.
1 Introduction
A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly speaking, a hypotheses space is learnable if there is a consistent learning
algorithm, i.e. one returning an optimal solution as the number of sample goes to infinity. Classic results for supervised learning characterize learnability of a function
class in terms of its complexity (combinatorial dimension) [17, 16, 1, 2, 9, 3]. Indeed, minimization of the empirical risk on a function class having finite complexity
can be shown to be consistent. A key aspect in this approach is the connection with
empirical process theory results showing that finite combinatorial dimensions characterize function classes for which a uniform law of large numbers holds, namely uniform
Glivenko-Cantelli classes [7].
More recently, the concept of stability has emerged as an alternative and effective
method to design consistent learning algorithms [4]. Stability refers broadly to continuity properties of learning algorithm to its input and it is known to play a crucial role
in in regularization theory [8]. Surprisingly, for certain classes of loss functions, a suitable notion of stability of ERM can be shown to characterize learnability of a function
class [10, 12, 11].
In this paper, after recalling some basic concepts (Section 2), we review results
characterizing learnability in terms of complexity and stability in supervised learning
(Section 3) and in the so called general learning (Section 4). We conclude with some
remarks and open questions.
1
2 Supervised Learning, Consistency and Learnability
In this section, we introduce basic concepts in Statistical Learning Theory (SLT). First,
we describe the supervised learning setting, and then, define the notions of consistency
of a learning algorithm and of learnability of a hypotheses class.
Consider a probability space (Z , ρ ), where Z = X × Y , with X a measurable
space and Y a closed subset of R. A loss function is a measurable map ℓ : R × Y →
[0, +∞). We are interested in the problem of minimizing the expected risk,
inf Eρ ,
Eρ ( f ) =
F
Z
X ×Y
ℓ( f (x), y) d ρ (x, y),
(1)
where F ⊂ Y X is the set of measurable functions from X to Y (endowed with the
product topology and the corresponding Borel σ -algebra). The probability distribution
ρ is assumed to be fixed but known only through a training set, i.e. a set of pairs
zn = ((x1 , y1 ), . . . , (xn , yn )) ∈ Z n sampled identically and independently according to
ρ . Roughly speaking, the problem of supervised learning is that of approximatively
solving Problem (1) given a training set zn .
Example 1 (Regression and Classification) In (bounded) regression Y is a bounded
interval in R, while in binary classification Y = {0, 1}. Examples of loss functions are
the square loss ℓ(t, y) = (t − y)2 in regression and the misclassification loss ℓ(t, y) =
1{t6=y} in classification. See [16] for a more exhaustive list of loss functions.
In the next section, the notion of approximation considered in SLT is defined rigorously.
We first introduce the concepts of hypotheses space and learning algorithm.
Definition 1 A hypotheses space is a set of functions H ⊆ F . We say that H is
universal if infF Eρ = infH Eρ , for all distributions ρ on Z .
Definition 2 A learning algorithm A on H is a map,
A:
[
Zn→H ,
zn 7→ Azn = A(zn ),
n∈N
such that, for all n ≥ 1, A|Z n is measurable with respect to the completion of the product
σ -algebra on Z n .
Empirical Risk Minimization (ERM) is arguably the most popular example of learning
algorithm in SLT.
Example 2 Given a training set zn the empirical risk Ezn : F → R is defined as
Ezn ( f ) =
1 n
∑ ℓ( f (xi ), yi ).
n i=1
Given a hypotheses space H , ERM on H is defined by minimization of the empirical
risk on H .
We add one remark.
2
Remark 1 (ERM and Asymptotic ERM) In general some care is needed while defining ERM since a (measurable) minimizer might not be ensured to exist. When Y =
{0, 1} and ℓ is the misclassification loss function, it is easy to see that a minimizer exists (possibly non unique). In this case measurability is studied for example in Lemma
6.17 in [15]. When considering more general loss functions or regression problems
one might need to consider learning algorithms defined by suitable (measurable) almost minimizers of the empirical risk (see e.g. Definition 10).
2.1 Consistency and Learnability
Aside from computational considerations, the following definition formalizes in which
sense a learning algorithm approximatively solves Problem (1).
Definition 3 We say that a learning algorithm A on H is uniformly consistent1 if
∀ε > 0,
lim sup ρ n {zn : Eρ (Azn ) − inf Eρ > ε } = 0,
n→+∞ ρ
H
and universally uniformly consistent if H is universal.
The next definition shifts the focus from a learning algorithm on H , to H itself.
Definition 4 We say that a space H is uniformly learnable if there exists a uniformly
consistent learning algorithm on H . If H is also universal we say that it is universally
uniformly learnable.
Note that, in the above definitions, the term “uniform” refers to the distribution for
which consistency holds, whereas “universal” refers to the possibility of solving Problem (1) without a bias due to the choice of H . The requirement of uniform learnability
implies the existence of a learning rate for A [15] or equivalently a bound on the sample
complexity [2]. The following classical result, sometimes called the ”no free lunch”
theorem, shows that uniform universal learnability of a hypotheses space is too much
to hope for.
Theorem 1 Let Y = {0, 1}, and X such that there exists a measure µ on X having
an atom-free distribution. Let ℓ be the misclassification loss. If H is universal, then
H is not uniformly learnable.
The proof of the above result is based on Theorem 7.1 in [6], which shows that for each
learning algorithm A on H and any fixed n, there exists a measure ρ on X × Y such
that the expected value of Eρ (Azn ) − infH Eρ is greater than 1/4. A general form of the
no free lunch theorem, beyond classification, is given in [15] (see Corollary 6.8). In particular, this result shows that the no free lunch theorem holds for convex loss functions,
as soon as there are two probability distributions ρ1 , ρ2 such that infH Eρ1 6= infH Eρ2
(assuming that minimizers exist). Roughly speaking, if there exist two learning problems with distinct solutions, then H cannot be universal uniformly learnable (this
latter condition becomes more involved when the loss is not convex).
1 Consistency can de defined with respect to other convergence notions for random variables. If the loss
function is bounded, convergence in probability is equivalent to convergence in expectation.
3
The no free lunch theorem shows that universal uniform consistency is too strong
of a requirement. Restrictions on either the class of considered distributions ρ or the
hypotheses spaces/algorithms are needed to define a meaningful problem. In the following, we will follow the latter approach where assumptions on H (or A), but not on
the class distributions ρ , are made.
3 Learnability of a Hypotheses space
In this section we study uniform learnability by putting appropriate restrictions on the
hypotheses space H . We are interested in conditions which are not only sufficient but
also necessary. We discuss two series of results. The first is classical and characterizes learnability of a hypotheses space in terms of suitable complexity measures. The
second, more recent, is based on the stability (in a suitable sense) of ERM on H .
3.1 Complexity and Learnability
Classically assumptions on H are imposed in the form of restrictions on its ”size”
defined in terms of suitable notions of combinatorial dimensions (complexity). The
following definition of complexity for a class of binary valued functions has been introduced in [17].
Definition 5 Assume Y = {0, 1}. We say that H shatters S ⊆ X if for each E ⊆ S
there exists fE ∈ H such that fE (x) = 0, if x ∈ E, and fE (x) = 1 is x ∈ S \ E. The
VC-dimension of H is defined as
VC(H ) = max{d ∈ N : ∃ S = {x1 , . . . xd } shattered by H }
The VC-dimension turns out to be related to a special class of functions, called uniform
Glivenko-Cantelli, for which a uniform form of the law of large numbers holds [7].
Definition 6 We say that H is a uniform Glivenko-Cantelli (uGC) class if it has the
following property
n
o
= 0.
lim sup ρ n zn : sup Eρ ( f ) − Ezn ( f ) > ε
∀ε > 0,
n→+∞ ρ
f ∈H
The following theorem completely characterizes learnability in classification.
Theorem 2 Let Y = {0, 1} and ℓ be the misclassification loss. Then the following
conditions are equivalent:
1. H is uniformly learnable,
2. ERM on H is uniformly consistent,
3. H is a uGC-class,
4. the VC-dimension of H is finite.
4
The proof of the above result can be found for example in [2] (see Theorems 4.9, 4.10
and 5.2). The characterization of uGC classes in terms of combinatorial dimensions is
a central theme in empirical process theory [7]. The results on binary valued functions
are essentially due to Vapnik and Chervonenkis [17]. The proof that uGC of H implies
its learnability is straightforward. The key step in the above proof is showing that
learnability is sufficient for finite VC-dimension, i.e. VC(H ) < ∞. The proof of this
last step crucially depends on the considered loss function.
A similar result holds for bounded regression with the square [1, 2] and absolute loss
functions [9, 3]. In this case, a new notion of complexity needs to be defined since
the VC-dimension of real valued function classes is not defined. Here, we recall the
definition of γ -fat shattering dimension of a class of functions H originally introduced
in [9].
Definition 7 Let H be a set of functions from X to R and γ > 0. Consider S =
{x1 , . . . , xd } ⊂ X . Then S is γ -shattered by H if there are real numbers r1 , . . . , rd such
that for each E ⊆ S there is a function fE ∈ H satisfying
(
fE (x) ≤ ri − γ ∀x ∈ S \ E
fE (x) ≥ ri + γ ∀x ∈ E.
We say that (r1 , . . . , rd ) witnesses the shattering. The γ -fat shattering dimension of H
is
fatH (γ ) = max{d : ∃ S = {x1 , . . . , xd } ⊆ X s.t. S is γ -shattered by H }.
As mentioned above, an analogous of Theorem 2 can be proved for bounded regression with the square and absolute losses, if condition 4) is replaced by fatH (γ ) < +∞
for all γ > 0. We end noting that is an open question proving that the above results
holds for loss function other than the square and absolute loss.
3.2 Stability and Learnability
In this section we show that learnability of a hypotheses space H is equivalent to the
stability (in a suitable sense) of ERM on H . It is useful to introduce the following
notation. For a given loss function ℓ, let L : F × Z → [0, ∞) be defined as L( f , z) =
ℓ( f (x), y), for f ∈ F and z = (x, y) ∈ Z . Moreover, let zin be the training zn with the
i-th point removed. With the above notation, the relevant notion of stability is given by
the following definition.
Definition 8 A learning algorithm A on H is uniformly CVloo stable if there exist
sequences (βn , δn )n∈N such that βn → 0, δn → 0 and
sup ρ n {|L(Azin , zi ) − L(Azn , zi )| ≤ βn } ≥ 1 − δn ,
ρ
for all i ∈ {1, . . . , n}.
5
(2)
Before illustrating the implications of the above definition to learnability we first add
a few comments and historical remarks. We note that, in a broad sense, stability refers
to a quantification of the continuity of a map with respect to its input. The key role of
stability in learning has long been advocated on the basis of the interpretation of supervised learning as an ill-posed inverse problems [11]. Indeed, the concept of stability
is central in the theory of regularization of ill-posed problem [8]. A first quantitative
connection between the performance of a symmetric learning algorithm2 and a notion
of stability is derived in the seminal paper [4]. Here a notion of stability, called uniform
stability, is shown to be sufficient for consistency. If we let zni,u be the training zn with
the i-th point replaced by u, uniform stability is defined as,
|L(Azi,u , z) − L(Azn , z)| ≤ βn ,
n
(3)
for all zn ∈ Z n , u, z ∈ Z n and i ∈ {1, . . . , n}. A thorough investigation of weaker
notions of stability is given in [10]. Here, many different notions of stability are shown
to be sufficient for consistency (and learnability) and the question is raised of whether
stability (of ERM on H ) can be shown to be necessary for learnability of H . In
particular a definition of CV stability for ERM is shown to be necessary and sufficient
for learnability in a Probably Approximate Correct (PAC) setting, that is when Y =
{0, 1} and for some h∗ ∈ H , y = h∗ (x), for all x ∈ X . Finally, Definition 8 of CVloo
stability is given and studied in [11]. When compared to uniform stability, we see
that: 1) the “replaced one” training set zni,u is considered instead of the “leave one out”
training set zin ; 2) the error is evaluated on the point zi which is left out, rather than
any possible z ∈ Z ; finally 3) the condition is assumed to hold for a fraction 1 − δn of
training sets (which becomes increasingly larger as n increases) rather than uniformly
for any training set zn ∈ Z n .
The importance of CVloo stability is made clear by the following result.
Theorem 3 Let Y = {0, 1} and ℓ be the misclassification loss function. Then the
following conditions are equivalent,
1. H is uniformly learnable,
2. ERM on H is CVloo stable
The proof of the above result is given in [11] and is based on essentially two steps.
The first is proving that CVloo stability of ERM on H implies that ERM is uniformly
consistent. The second is showing that if H is a uGC class then ERM on H is CVloo
stable. Theorem 3 then follows from Theorem 2 (since uniform consistency of ERM
on H and H being uGC are equivalent).
Both steps in the above proof can be generalized to regression as long as the loss
function is assumed to be bounded. The latter assumption holds for example if the
loss function satisfies a suitable Lipschitz condition and Y is compact (so that H
is a set of uniformly bounded functions). However, generalizing Theorem 3 beyond
classification requires the generalization of Theorem 2. For the the square and absolute
loss functions and Y compact, the characterization of learnability in terms of γ -fat
2 We
say that a learning algorithm A is symmetric if it does not depend on the order of the points in zn .
6
shattering dimension can be used. It is an open question whether there is a more direct
way to show that learnability is sufficient for stability, independently to Theorem 2 and
to extend the above results to more general classes of loss functions. We will see a
partial answer to this question in Section 4.
4 Learnability in the General Learning Setting
In the previous sections we focused our attention on supervised learning. Here we ask
whether the results we discussed extend to the so called general learning [16].
Let (Z , ρ ) be a probability space and F a measurable space. A loss function is
a map L : F × Z → [0, ∞), such that L( f , ·) is measurable for all f ∈ F . We are
interested in the problem of minimizing the expected risk,
inf Eρ ,
H
Eρ ( f ) =
Z
Z
L( f , z) d ρ (z),
(4)
when ρ is fixed but known only through a training set, zn = (z1 , . . . , zn ) ∈ Z n sampled
identically and independently according to ρ . Definition 2 of a learning algorithm on
H applies as is to this setting and ERM on H is defined by the minimization of the
empirical risk
1 n
Ezn ( f ) = ∑ L( f , zi ).
n i=1
While general learning is close to supervised learning, there are important differences.
The data space Z has no natural decomposition, F needs not to be a space of functions. Indeed, F and Z are related only via the loss function L. For our discussion
it is important to note that the distinction between F and the hypotheses space H
becomes blurred. In supervised learning F is the largest set of functions for which
Problem (1) is well defined (measurable functions in Y X ). The choice of a hypotheses corresponds intuitively to a more ”manageable” function space. In general learning
the choice of F is more arbitrary as a consequence the the definition of universal hypotheses space is less clear. The setting is too general for an analogue of the no free
lunch theorem to hold. Given these premises, in what follows we will simply identify
F = H and consider the question of learnability, noting that the definition of uniform
learnability extends naturally to general learning. We present two sets of ideas. The
first, due to Vapnik, focuses on a more restrictive notion of consistency of ERM. The
second, investigates the characterization of uniform learnability in terms of stability.
4.1 Vapnik’s Approach and Non Trivial Consistency
The extension of the classical results characterizing learnability in terms of complexity
measure is tricky. Since H is not a function space the definitions of VC or Vγ dimensions do not make sense. A possibility is to consider the class L ◦ H := {z ∈ Z 7→
L( f , z) for some f ∈ H } and the corresponding VC dimension (if L is binary valued)
or Vγ dimension (if L is real valued). Classic results about the equivalence between
the uGC property and finite complexity apply to the class L ◦ H . Moreover, uniform
7
learnability can be easily proved if L ◦ H is a uGC class. On the contrary, the reverse
implication does not hold in the general learning setting. A counterexample is given
in [16] (Sec. 3.1) showing that it is possible to design hypotheses classes with infinite
VC (or Vγ ) dimension, which are uniformly learnable with ERM. The construction is
as follows. Consider an arbitrary set H and a loss L for which the class L ◦ H has
f:= H ∪ {h̃} by adding to H an
infinite VC (or Vγ ) dimension. Define a new space H
element h̃ such that L(h̃, z) ≤ L(h, z) for all z ∈ Z and h ∈ H 3 . The space L ◦ Hf has
infinite VC, or Vγ , dimension and is trivially learnable by ERM, which is constant and
coincides with h̃ for each probability measure ρ . The previous counterexample proves
that learnability, and in particular learnability via ERM, does not imply finite VC or Vγ
dimension. To avoid these cases of “trivial consistency” and to restore the equivalence
between learnability and finite dimension, the following stronger notion of consistency
for ERM has been introduced by Vapnik [16].
Definition 9 ERM on H is strictly uniformly consistent if and only if
∀ε > 0,
lim sup ρ n ( inf Ezn ( f ) − inf Eρ ( f ) > ε ) = 0,
n→∞ ρ
Hc
Hc
where Hc = { f ∈ H : Eρ ( f ) ≥ c}.
The following result characterizes strictly uniform consistency in terms of uGC property of the class L ◦ H (see Theorem 3.1 and its Corollary in [16]])
Theorem 4 Let B > 0 and assume L( f , z) ≤ B for all f ∈ H and z ∈ Z . Then the
following conditions are equivalent,
1. ERM on H is strictly consistent,
2. L ◦ H is a uniform one-sided Glivenko-Cantelli class.
The definition of one-sided Glivenko-Cantelli class simply corresponds to omitting the
absolute value in Definition 6.
4.2 Stability and Learnability for General Learning
In this section we discuss ideas from [14] extending the stability approach to general
learning. The following definitions are relevant.
Definition 10 A uniform Asymptotic ERM (AERM) algorithm A on H is a learning
algorithm such that
∀ε > 0,
lim sup ρ n ({zn : Ezn (Azn ) − inf Ezn > ε }) = 0.
n→∞ ρ
H
Definition 11 A learning algorithm A on H is uniformly replace one (RO) stable if
there exists a sequence βn → 0 such that
1 n
∑ |L(Azi,un , z) − L(Azn , z)| ≤ βn .
n i=1
for all zn ∈ Z n , u, z ∈ Z n and i ∈ {1, . . . , n}.
3 Note
that this construction is not possible in classification or in regression with the square loss.
8
Note that the above definition is close to that of uniform stability (3), although the latter
turns out to be a stronger condition. The importance of the above definitions is made
clear by the following result.
Theorem 5 Let B > 0 and assume L( f , z) ≤ B for all f ∈ H and z ∈ Z . Then the
following conditions are equivalent,
1. H is uniformly learnable,
2. there exists an AERM algorithm on H which is RO stable.
As mentioned in Remark 1, Theorem 3 holds not only for exact minimizers of the
empirical risk, but also for AERM. In this view, there is a subtle difference between
Theorem 3 and Theorem 5. In supervised learning, Theorem 3 shows that uniform
learnability implies that every ERM (AERM) is stable, while in general learning, Theorem 5 shows that uniform learnability implies the existence of a stable AERM (whose
construction is not explicit).
The proof of the above result is given in Theorem 7 in [14]. The hard part of the
proof is showing that learnability implies existence of a RO stable AERM. This part of
the proof is split in two steps showing that: 1) if there is a uniformly consistent algorithm A, then there exists a uniformly consistent AERM A′ (Lemma 20 and Theorem
10); 2) every uniformly consistent AERM is also RO stable (Theorem 9). Note that the
results in [14] are given in expectation and with some quantification of how different
convergence rates are related. Here we give results in probability to be uniform with
the rest of the paper and state only asymptotic results to simplify the presentation.
5 Discussion
In this paper we reviewed several results concerning learnability of a hypotheses space.
Extensions of these ideas can be found in [5] (and references therein) for multi-category
classification, and in [13] for sequential prediction. It would be interesting to devise
constructive proofs in general learning suggesting how stable learning algorithms can
be designed. Moreover, it would be interesting to study universal consistency and
learnability in the case of samples from non stationary processes.
References
[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997.
[2] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations.
Cambridge University Press, Cambridge, 1999.
[3] P. Bartlett, P. Long, and R. Williamson. Fat-shattering and the learnability of
real-valued functions. Journal of Computer and System Sciences, 52:434–452,
1996.
9
[4] O. Bousquet and A. Elisseeff. Stability and generalization,. Journal of Machine
Learning Research, 2:499–526, 2002.
[5] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. Journal of Machine Learning Research Proceedings Track, 19:207–232, 2011.
[6] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of mathematics. Springer, New York, 1996.
[7] R. Dudley, E. Giné, and J. Zinn. Uniform and universal Glivenko-Cantelli classes.
J. Theoret. Prob., 4:485–510, 1991.
[8] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems,
volume 375 of Mathematics and its Applications. Kluwer Academic Publishers
Group, Dordrecht, 1996.
[9] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of
probabilistic concepts. In Computational learning theory and natural learning
systems, Vol. I, Bradford Book, pages 289–329. MIT Press, Cambridge, MA,
1994.
[10] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical Report TR-2002-03, Department of Computer Science, The
University of Chicago, 2002.
[11] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is
sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25(1-3):161–193, 2006.
[12] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419–422, 2004.
[13] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Beyond regret. Journal of Machine Learning Research - Proceedings Track, 19:559–
594, 2011.
[14] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability
and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, 2010.
[15] I. Steinwart and A. Christmann. Support vector machines. Information Science
and Statistics. Springer, New York, 2008.
[16] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New
York, 1995.
[17] V. N. Vapnik and A. Y. Chervonenkis. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal
solution from empirical data. Avtomat. i Telemeh., (2):42–53, 1971.
10