Industrial Mathematics Institute: Research Report
Industrial Mathematics Institute: Research Report
Industrial Mathematics Institute: Research Report
Research Report
2004:10
Department of Mathematics
University of South Carolina
ON MATHEMATICAL METHODS OF LEARNING
1. Introduction
We discuss in this paper some mathematical aspects of supervised learning theory. Su-
pervised learning, or learning-from-examples, refers to a process that builds on the base of
available data of inputs xi and outputs yi , i = 1, . . . , m, a function that best represents
the relation between the inputs x ∈ X and the corresponding outputs y ∈ Y . The central
question is how well this function estimates the outputs for general inputs. The standard
mathematical framework for the setting of the above learning problem is the following ([CS],
[PS]).
Let X ⊂ Rd , Y ⊂ R be Borel sets and let ρ be a Borel probability measure on Z = X ×Y .
For f : X → Y define the error
Z
E(f ) := Eρ (f ) := (f (x) − y)2 dρ.
Z
1
2 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
where inf L is taken over all n-dimensional subspaces of B. In other words the Kolmogorov
n-width gives the best possible error in approximating a compact set F by n-dimensional
linear subspaces. So, first of all we need to make an assumption on the unknown function
fρ . Following the approximation theory approach we make this assumption in the form
fρ ∈ W , where W is a given class of functions. For instance, we may assume that fρ has
some smoothness. The next step is to find an algorithm for constructing an estimator fz
that is optimal for the class W . By optimal we mean the one that provides the minimal error
kf − fz k for all f ∈ W with high probability. A problem of optimization is naturally broken
into two parts: upper estimates and lower estimates. We discuss only upper estimates in
this paper. In order to prove upper estimates we need to decide what should be the form of
an estimator fz . In other words we need to specify the hypothesis space H (see [CS], [PS])
where an estimator fz comes from. We may also call H an approximation space.
The next question is how to build fz ∈ H. In Section 2 we will discuss the method that
takes
fz,H = arg min Ez (f ),
f ∈H
where
m
1 X
Ez (f ) := (f (xi ) − yi )2
m i=1
is the empirical error of f . This fz,H is called the empirical optimum. Section 2 contains a
discussion of known results from [CS] and some new results. Proofs of new results in Section
2 are based on the technique developed in [CS].
In Section 3 we assume that ρ is an absolutely continuous measure with density µ(x):
dρ = µdx. We study estimation of a new function fµ := fρ µ instead of regression function
fρ . As a part of motivation of this new setting we discuss one practical example from finance.
Let x = (x1 , . . . , xd ) ∈ Rd be information that is used by a bank to decide to give or not
to give a mortgage to a client. For instance, x1 - income, x2 - home value, x3 - mortgage
value, x4 - interest rate. Let y be the total profit (loss if negative) that the bank gains from
this mortgage. Then fρ (x) stands for the expected (average) profit of the bank from clients
with the same information x. A clear goal of the bank is to find S ⊂ Rd that maximizes
Z
fρ (x)dρX .
S
So, we look for an estimator for fµ instead of fρ . In the above example it is sufficient to
estimate kfµ − fz kL1 . Suppose we have found fz such that
Define
Sz := {x : fz (x) ≥ 0}.
Then with the above estimate on the probability we have
Z Z Z
fρ (x)dρX = fµ (x)dx ≥ fz (x)dx − ǫ
Sz Sz Sz
Z Z
≥ fz (x)dx − ǫ ≥ fρ (x)dρX − 2ǫ.
So So
Therefore, the empirical set Sz provides an optimal profit within an error 2ǫ with probability
≥ 1 − δ.
In the above example it was convenient to measure the error in the L1 norm. However,
it is usually simpler to estimate the L2 error of approximation. We note that
By C and c we denote absolute positive constants and by C(·), c(·), and A0 (·) we denote
positive constants that are determined by their arguments. We often have error estimates
of the form (ln m/m)α that hold for m ≥ 2. We could write these estimates in the form,
say, (ln(m + 1)/m)α to make them valid for all m ∈ N. However, we use the first variant
throughout the paper for the following two reasons: simpler notations, we are looking for
the asymptotic behavior of the error.
2. Estimating fρ
Let ρ be a Borel probability measure on Z = X × Y . If ξ is a random variable (a real
valued function on a probability space Z) then denote
Z Z
2
E(ξ) := ξdρ; σ (ξ) := (ξ − E(ξ))2 dρ.
Z Z
The following proposition gives a relation between E(f ) − E(fρ ) and kf − fρ kL2 (ρX ) .
Proposition 2.1 [CS]. For every f : X → Y
Z
E(f ) − E(fρ ) = (f (x) − fρ (x))2 dρX .
X
Theorem A [CS]. Let M > 0 and f : X → Y be such that |f (x) − y| ≤ M a.e. Then, for
all ǫ > 0
mǫ2
(2.1) Probz∈Z m {|Lz (f )| ≤ ǫ} ≥ 1 − 2 exp(− ),
2(σ 2 + M 2 ǫ/3)
Taking ξ(z) := (f (x) − y)2 and noting that E(ξ) = E(f ) we get (2.1).
Let X be a compact subset of Rd . Denote C(X) the space of functions continuous on X
with the norm
kf k∞ := sup |f (x)|.
x∈X
The paper [CS] indicates importance of a characteristic of a class W closely related to the
concept of entropy numbers. For a compact subset W of a Banach space B we define the
entropy numbers as follows
n
ǫn (W, B) := inf{ǫ : ∃f1 , . . . , f2n ∈ W : W ⊂ ∪2j=1 (fj + ǫU (B))}
where U (B) is the unit ball of Banach space B. We denote N (W, ǫ, B) the covering number
that is the minimal number of balls of radius ǫ needed for covering W . In this paper in the
most cases we take as a Banach space B the space C := C(X) of continuous functions on a
compact X ⊂ Rd . We use the abbreviated notations
N (W, ǫ) := N (W, ǫ, C); ǫn (W ) := ǫn (W, C).
Theorem B [CS]. Let W be a compact subset of C(X). Assume that for all f ∈ W ,
f : X → Y is such that |f (x) − y| ≤ M a.e. Then, for all ǫ > 0
mǫ2
(2.2) Probz∈Z m { sup |Lz (f )| ≤ ǫ} ≥ 1 − N (W, ǫ/(8M ))2 exp(− ).
f ∈W 2(σ 2 + M 2 ǫ/3)
Here σ 2 := σ 2 (W ) := supf ∈W σ 2 ((f (x) − y)2 ).
We will give a proof of this theorem for completeness. We use the following simple
relation.
Proposition 2.2 [CS]. If |fj (x) − y| ≤ M a.e. for j = 1, 2, then
|Lz (f1 ) − Lz (f2 )| ≤ 4M kf1 − f2 k∞ .
Theorem C [CS]. Let H be a compact subset of C(X). Assume that for all f ∈ H,
f : X → Y is such that |f (x) − y| ≤ M a.e. Then, for all ǫ > 0
mǫ2
Probz∈Z m {E(fz,H ) − E(fH ) ≤ ǫ} ≥ 1 − N (H, ǫ/(8M ))2 exp(− ).
8(4σ 2 + M 2 ǫ/3)
Then
1/r
(2.4) N (W, ǫ) ≤ 2(C2 /ǫ) .
r
Substituting this in Theorem C and optimizing over ǫ we get for ǫ = Am− 1+2r , A ≥ A0 (M, r)
r 1
Probz∈Z m {E(fz,W ) − E(fW ) ≤ Am− 1+2r } ≥ 1 − exp(−c(M )A2 m 1+2r ).
ǫn (W ) ≤ C1 n−r , n = 1, 2, . . . , W ⊂ C1 U (C).
We will now impose some extra restictions on W and will get in return better estimates
for E(fz ) − E(fW ). We begin with the one from [CS] (see Theorem C∗ and Remark 13).
Theorem C∗ [CS]. Suppose that either H is a compact and convex subset of C(X) or H
is a compact subset of C(X) and fρ ∈ H. Assume that for all f ∈ H, f : X → Y is such
that |f (x) − y| ≤ M a.e. Then, for all ǫ > 0
mǫ
Probz∈Z m {E(fz,H ) − E(fH ) ≤ ǫ} ≥ 1 − N (H, ǫ/(24M ))2 exp(− ).
288M 2
mǫ
Probz∈Z m {E(fz,H ) − E(fH ) ≤ ǫ} ≥ 1 − N (H, ǫ/(24M ))2 exp(− )
C(M, K)
We use the assumption E(fH ) − E(fρ ) ≤ Kǫ instead of convexity and we use the following
lemma instead of Lemma 2.1.
Lemma 2.2. For any f we have
Proof. We have
Next,
kf − fρ k2L2 (ρX ) = E(f ) − E(fρ ) = E(f ) − E(fH ) + E(fH ) − E(fρ ).
Theorem 2.2. Assume that W satisfies (2.3). Suppose that either W is convex or fρ ∈ W .
Then for A ≥ A0 (M, r)
r 1
Probz∈Z m {E(fz,W ) − E(fW ) ≤ Am− 1+r } ≥ 1 − exp(−c(M )Am 1+r ).
The proof of this theorem repeats the proof of Theorem 2.1 with the following changes:
r
we set ǫ = Am− 1+r and use Theorem C∗ to estimate E(fz,W ) − E(fW ).
We continue to consider classes satisfying a stronger assumption than (2.3). We first use
the concept of the Kolmogorov width to impose an extra condition on W . Let us assume
that W satisfies the following estimates for the Kolmogorov widths
ǫn (W ) ≤ C2 n−r , n = 1, 2, . . . .
Therefore, for this class we have the estimate (2.5) as above in Theorem 2.1. We will prove
a better estimate than (2.5).
Theorem 2.3. Let fρ ∈ W and let W satisfy (2.6). Then there exists an estimator fz such
that for A ≥ A0 (M, r)
2r 1
(2.7) Probz∈Z m {E(fz ) − E(fρ ) ≤ CA(ln m/m) 1+2r } ≥ 1 − exp(−c(M )A(m(ln m)2r ) 1+2r ).
Proof. Let a sequence {Ln } be a sequence of optimal (near optimal) subspaces for W ,
dim Ln = n. Then for any f ∈ W there is a ϕn ∈ Ln such that kf − ϕn k∞ ≤ C1 n−r .
It is clear that kϕn k∞ ≤ 2kf k∞ ≤ 2C1 . We now consider in place of W from the above
argument that gave Theorem 2.1 the set Vn := 2C1 U (C) ∩ Ln . In other words we take as a
hypothesis space the set Vn . Then it is well known that
2r
Choosing ǫ = A(ln m/m) 1+2r and n = [ǫ−1/(2r) ] + 1 we now estimate E(fz,Vn ) − E(fρ ). Let
f ∗ ∈ Vn be such that
r
kfρ − f ∗ k∞ ≤ C1 n−r ≤ CA1/2 (ln m/m) 1+2r .
Then
Z
2r
∗
(2.8) E(f ) − E(fρ ) = (f ∗ (x) − fρ (x))2 dρX ≤ C 2 A(ln m/m) 1+2r .
X
ON MATHEMATICAL METHODS OF LEARNING 9
We have
0 ≤ E(fz,Vn ) − E(fρ ) = E(fz,Vn ) − E(f ∗ ) + E(f ∗ ) − E(fρ ).
Next,
N ≍ nan ,
Then by [T5]
ǫn (W )∞ ≤ C2 (ln n/n)r , n = 2, 3, . . . .
For this class we have the estimate similar to (2.5) from above with m replaced by m/ ln m.
It is clear that a class satisfying (2.11) is wider than a class satisfying dn (W, C) ≤ C1 n−r .
We will prove an estimate for W satisfying (2.11) similar to (2.7).
10 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
Theorem 2.4. Let fρ ∈ W and let W satisfy (2.11). Then there exists an estimator fz
such that for A ≥ A0 (M, r, a)
2r 1
Probz∈Z m {E(fz ) − E(fρ ) ≤ CA(ln m/m) 1+2r } ≥ 1 − exp(−c(M )A(m(ln m)2r ) 1+2r ).
such that kf − ϕn k∞ ≤ C1 n−r . It is clear that kϕn k∞ ≤ 2kf k∞ ≤ 2C1 . We now consider
the following set
Un := ∪N j
j=1 Vn ,
n
Vnj := 2C1 U (C) ∩ Ljn .
Then it is clear that
N (Un , ǫ) ≤ Nn (C3 /ǫ)n .
We construct an estimator for fρ ∈ W by
2r
Choosing ǫ = A(ln m/m) 1+2r and n = [ǫ−1/(2r) ] + 1 we now estimate E(fz,Un ) − E(fρ ). Let
f ∗ ∈ Un be such that
r
kfρ − f ∗ k∞ ≤ C1 n−r ≤ C1 A1/2 (ln m/m) 1+2r .
Then
Z
2r
∗
(2.12) E(fUn ) − E(fρ ) ≤ E(f ) − E(fρ ) = (f ∗ (x) − fρ (x))2 dρX ≤ C12 A(ln m/m) 1+2r .
X
We have
0 ≤ E(fz,Un ) − E(fρ ) = E(fz,Un ) − E(fUn ) + E(fUn ) − E(fρ ).
Taking into account that fz,Un ∈ Un , E(fUn ) − E(fρ ) ≤ C12 ǫ,
1
≥ 1 − exp(−c(M )A(m(ln m)2r ) 1+2r ), A ≥ A0 (M, r, a).
This completes the proof of Theorem 2.4.
Let us discuss an example how Theorem 2.4 may be applied. We will consider n-term
approximations with regard to a given system Ψ. Assume that the system Ψ = {ψj }∞ j=1
satisfies the condition:
(VP) There exist three positive constants Ai , i = 1, 2, 3, and a sequence {nk }∞ k=1 , nk+1 ≤
A1 nk , k = 1, 2, . . . such that there is a sequence of the de la Vallée-Poussin type operators
Pk with the properties
Pk (ψj ) = λk,j ψj ,
λk,j = 1 for j = 1, . . . , nk ; λk,j = 0 for j > A2 nk ,
kPk kC→C ≤ A3 , k = 1, 2, . . . .
Denote
n
X
σn (f, Ψ) := inf kf − cj ψkj k∞ ,
k1 ,...,kn ;c1 ,...,cn
j=1
and
σn (W, Ψ) := sup σn (f, Ψ).
f ∈W
Theorem 2.5. Let fρ ∈ W and let W satisfy the following two conditions.
n
X
En (W, Ψ) := sup inf kf − cj ψj k∞ ≤ C2 n−b ,
f ∈W c1 ,...,cn j=1
where Ψ is the (VP)-system. Then there exists an estimator fz such that for A ≥ A0 (M, r, b)
2r 1
Probz∈Z m {E(fz ) − E(fρ ) ≤ CA(ln m/m) 1+2r } ≥ 1 − exp(−c(M )A(m(ln m)2r ) 1+2r ).
Proof. Define
X
Ψ(n) := {f : f = cj ψj , |Λ| ≤ n, Λ ⊂ [1, [nr/b ] + 1], kf k∞ ≤ 2C1 }.
j∈Λ
1
As an estimator fz we take fz,Ψ(n) with n = [A−1/(2r) (m/ ln m) 1+2r ] + 1. For this n we
consider the following family of n-dimensional subspaces:
X
XΛ := {f : f = cj ψj , |Λ| = n}, Λ ⊂ [1, [nr/b ] + 1].
j∈Λ
12 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
Theorem 2.6. For a given collection W(L, α, 1/2), α > 0, there exists an estimator fz
such that if fρ ∈ W r (L), r ∈ [α, 1/2] then for A ≥ A0 (M, α)
2r
Probz∈Z m {E(fz ) − E(fρ ) ≤ CA(ln m/m) 1+2r } ≥ 1 − mC1 (M )(C2 (M )−A) .
Proof. We use the notations from the proof of Theorem 2.3. We define an estimator fz by
the formula
fz := fz,Vk
with
k = arg min (Ez (fz,Vn ) + An ln m/m).
1≤n≤m
We will use the following result from [CS] (it is a direct corollary to Proposition 7 from
[CS]).
Lemma 2.3. Let H be a compact and convex subset of C(X). Then for all ǫ > 0 with
probability at least
ǫ mǫ
1 − N (H, ) exp(− )
24M 288M 2
one has for all f ∈ H
Applying Lemma 2.3 with H = Vn , ǫ = An ln m/m, f = fz,Vn and using that E(fVn ) ≥ E(fρ )
we get for n ∈ [1, m]
To estimate the right side we take n = n(r) := [A−1/(2r) (m/ ln m)1/(1+2r) ] + 1. We have
Ez (fz,Vn(r) ) ≤ Ez (fVn(r) ).
with probability ≥ 1 − mC3 (M )(C4 (M )−A) . Next, in the same way as we got (2.8) we obtain
2r
E(fVn(r) ) ≤ E(fρ ) + C12 A(ln m/m) 1+2r .
Now, if we impose an extra assumption on W in the form of decay of the Kolmogorov widths
(2.6) then Theorem 2.3 gives
2r
(2.20) ǫ(W, z) ≪ (ln m/m) 1+2r .
14 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
This estimate is better than (2.19). In a particular case W = W2r the Sobolev class on [0, 1]
one can derive from (2.19) and from known estimates ([BS])
We also note that the estimate (2.22) is only an existence theorem. We do not know
subspaces that provide approximation in (2.22). Therefore, Theorem 2.3 does not give a
constructive estimator providing (2.23). However, we can use Theorem 2.5 to overcome this
problem. For instance, in the periodic case it is known ([DT]) that
As a hypothesis space H one can take here the following set of n-term trigonometric poly-
nomials X 2r 2r
H = {f = ck eikx , |Λ| ≤ n, Λ ⊂ [−n 2r−1 , n 2r−1 ], kf k∞ ≤ C}
k∈Λ
1
with n ≍ (m/ ln m) 1+2r .
ON MATHEMATICAL METHODS OF LEARNING 15
3. Estimating fµ
In this section we assume that ρX is an absolutely continuous measure with density µ(x):
dρX = µdx. We keep the notations from the previous section. Our major goal in this section
is to estimate the function fµ := fρ µ instead of the function fρ . In this section we assume
that |y| ≤ M . In the probability estimates we will use the following notations
Denote
n
X
Sn (fµ ) := cj ψj .
j=1
Consider
m
1 X
ĉj := ĉj (z) := yi ψj (xi ).
m i=1
Then Z Z
E(ĉj (z)) = fρ (x)ψj (x)dρX = fµ (x)ψj (x)dx = cj .
X X
Assume now that W is a class satisfying the following approximation property: for f ∈ W
one has
Xn Z
−r
(3.2) kf − cj (f )ψj k2 ≤ C1 n , cj (f ) := f ψj dx.
j=1 X
It is clear that
n
X
kSn (fµ ) − f(z,n) k22 = |ĉj (z) − cj |2 .
j=1
16 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
3.2. In the previous subsection we considered the case of general uniformly bounded
orthonormal basis. In this subsection we restrict ourselves to the case µ = 1 (fρ = fµ )
and also impose some additional (or other) assumptions on a basis Ψ and we obtain error
estimates in the Lp -norm. We note that instead of assuming µ = 1 in the arguments that
follow it is sufficient to assume that µ ≤ C with absolute constant C. Then we obtain
the same results for fµ instead of fρ . In order to illustrate new technique we consider a
periodic case with a basis T the trigonometric system {eikx }. Let Vn (x) denote the de la
Vallée-Poussin kernel. We define an estimator for fρ by the formula:
m
1 X 1
(3.5) fz,V P := yi Vn (x − xi ).
m i=1 π
2π 2π
1 1
Z Z
E(ξ) = fρ (u)Vn (x − u)dρX = fρ (u)Vn (x − u)du =: Vn (fρ )(x).
π 0 π 0
Let x(l) = πl/(4n), l = 1, . . . , 8n. By Bernstein’s inequality for each l ∈ [1, 8n] we have
mǫ2
Probz∈Z m {|Vn (fρ )(x(l)) − fz,V P (x(l))| ≥ ǫ} ≤ 2 exp(− ).
C(M )n
Using the Marcinkiewicz-Zygmund [Z] theorem: for any trigonometric polynomial t of order
N one has
ktk∞ ≤ C1 max |t(kπ/(2N ))|
1≤k≤4N
we obtain
mǫ2
(3.6) Probz∈Z m {kVn (fρ ) − fz,V P k∞ ≤ C1 ǫ} ≥ 1 − nC2 exp(− ).
C(M )n
We define the class Wpr (T ) as the set of f that satisfy the estimate:
kf − Vn (f )kp ≤ C3 n−r , 1 ≤ p ≤ ∞.
r
Assume that fρ ∈ Wpr (T ). We specify ǫ = (A ln m/m) 1+2r , n = [ǫ−1/r ] + 1. Then (3.6)
implies
r
Probz∈Z m {kVn (fρ ) − fz,V P k∞ ≤ C1 (A ln m/m) 1+2r } ≥ 1 − w(m, A)
and
r
Probz∈Z m {kfρ − fz,V P kp ≤ (C1 + C3 )(A ln m/m) 1+2r } ≥ 1 − w(m, A).
We point out that in this subsection we have obtained the Lp estimates for 1 ≤ p ≤ ∞.
We conclude this subsection by formulating the result proved above as a theorem.
Theorem 3.2. Assume µ = 1 and fρ ∈ Wpr (T ) with some 1 ≤ p ≤ ∞. Then the estimator
1
fz,V P defined by (3.5) with n = [(m/(A ln m)) 1+2r ] + 1 provides
r
Probz∈Z m {kfρ − fz,V P kp ≤ (C1 + C3 )(A ln m/m) 1+2r } ≥ 1 − w(m, A).
We note that the estimator fz,V P from Theorem 3.2 does not depend on p and depends
on r (the choice of n depends on r). We proceed to construction of an estimator that is
universal for r. We denote
Wp [T ] := {Wpr (T )}.
Theorem 3.3. For a given collection Wp [T ] there exists an estimator fz such that if fρ ∈
Wpr (T ) with some r ≤ R then
r
Probz∈Z m {kfρ − fz kp ≤ C(R)A1/2 (ln m/m) 1+2r } ≥ 1 − w(m, A).
18 R. DEVORE, G. KERKYACHARIAN, D. PICARD, V. TEMLYAKOV
Proof. We define
A0 := V1 ; As := V2s − V2s−1 , s = 1, 2, . . . ; As (f ) := As ∗ f,
where ∗ means convolution. Using our assumption that fρ ∈ Wpr (T ) we get for all s
Similarly to (3.6) with ǫ = (A(2s /m) ln m)1/2 we get for all s ∈ [0, log m]
with probability at least 1 − w(m, A). We now consider only those z that satisfy (3.8). We
[log m]
build an estimator fz on the base of the sequence {kfs,z kp }s=0 . First, if
Therefore, for z satisfying (3.8) and (3.9) we get from (3.8)–(3.10), (3.7) that
∞
X r
1/2
kfρ kp ≤ C1 (R)A min(2s ln m/m)1/2 , 2−rs ) ≤ C2 (R)A1/2 (ln m/m) 1+2r .
s=0
Second, if (3.9) is not satisfied then we let l ∈ [0, log m] be such that for s ∈ (l, log m]
and
We set n = 2l and
m
1 X
fz := yi Vn (x − xi ).
m i=1
ON MATHEMATICAL METHODS OF LEARNING 19
l0
X l
X
−rl0 1/2 s 1/2
≤ C3 2 + (2C1 A + K)((2 /m) ln m) + C1 (A(2s /m) ln m)1/2
s=l+1 s=0
r
≤ C(R)A1/2 (ln m/m) 1+2r .
This completes the proof of Theorem 3.3.
We now point out that the above method that has been applied to the trigonometric
system works also for more general systems. Let Ψ = {ψj }∞ j=1 be an orthonormal basis.
First, we assume that Ψ is the (VP)-system for C. Using notations from the definition of
the (VP)-system we define
X
VnΨk (x, u) := λk,j ψj (x)ψj (u).
j
Then Z
Pk (f ) = f (u)VnΨk (x, u)du.
X
Second, we assume that for all k and x, u ∈ X
(I) |VnΨk (x, u)| ≤ C ′ nk , kVnΨk (x, ·)k22 ≤ C ′ nk .
Denote
Ψ(n) := span{ψ1 , . . . , ψn }.
Third, we assume that for each n there exists a set of points ξ 1 , . . . , ξ N (n) ∈ X such that
N (n) ≤ nc and for any f ∈ Ψ(n)
(II) kf k∞ ≤ C ′′ max |f (ξ i )|.
i
Then r
Probz∈Z m {kfρ − fz kp ≤ C(A ln m/m) 1+2r } ≥ 1 − w(m, A)
with constants that may depend on C ′ , c, C ′′ , C 1 and parameters A1 , A2 , A3 from the
definition of the (VP)-system.
The universality technique can also be extended to the bases Ψ from Theorem 3.2Ψ. We
denote
Wp [Ψ] := {Wpr (Ψ)}.
Theorem 3.3Ψ. Let Ψ be an orthonormal basis. Suppose Ψ is the (VP)-system for C
satisfying (I) and (II). For a given collection Wp [Ψ] there exists an estimator fz such that
if fρ ∈ Wpr (Ψ) with some r ≤ R then
r
Probz∈Z m {kfρ − fz kp ≤ CA1/2 (ln m/m) 1+2r } ≥ 1 − w(m, A)
We define
Z
AΨ
0 := VnΨ1 ; AΨ
s := VnΨs+1 − VnΨs , s = 1, 2, . . . , AΨ
s (f ) := f (u)AΨ
s (x, u)du.
X
kAΨ −r
s (fµ )kp ≤ K1 ns .
and
Ψ
kfl,z kp > (C ′′ A1/2 + K1 )((nl /m) ln m)1/2 .
We set
m
1 X
fzΨ := yi VnΨl (x, xi ).
m i=1
3.3. In this subsection we extend the results of Section 3.1 to a wider range of function
classes. We impose a weaker assumption than (3.2). It will be formulated in terms of
nonlinear n-term approximations. Let Ψ be an orthonormal uniformly bounded basis for
L2 (X), kψj k∞ ≤ B. Denote
n
X
σn (f, Ψ)2 := inf kf − cj ψkj k2 .
k1 ,...,kn ;c1 ,...,cn
j=1
Assume fµ satisfies
and
with r ≤ R. Then
r
Probz∈Z m {kfµ − fz k2 ≤ C(R)(A ln m/m) 1+2r } ≥ 1 − w(m, A; B, R/a).
Consider those z that satisfy maxj∈[1,N ] |ĉj (z) − cj | ≤ η. For these z and j ∈ Λ(z, η) we get
|cj | ≥ η. Also, if j is such that |cj | ≥ 3η then j ∈ Λ(z, η). Moreover, we have
X
(3.16) k (ĉj (z) − cj )ψj k2 ≤ η|Λ(z, η)|1/2 .
j∈Λ(z,η)
and, therefore,
2
(3.17) |Λ(z, η)| ≤ C(R)η − 1+2r .
Next,
X X 2r
(3.18) k cj ψj k2 ≤ k cj ψj k2 ≤ C(R)η 1+2r .
j∈[1,N ]\Λ(z,η) j:|cj |≤3η
For all r ≤ R we choose N = [mR/a ] and η = (A ln m/m)1/2 . Then combining (3.12), (3.13),
(3.15), (3.16), (3.17), and (3.18) we obtain
r
(3.19) Probz∈Z m {kfµ − fz,η k2 ≤ C(R)(A ln m/m) 1+2r } ≥ 1 − w(m, A; B, R/a).
We stress here that the estimator fz from Theorem 3.4 does not depend on r. In this
sense the estimator fz is universal. It adjusts automatically to the optimal smoothness of
the class that contains fµ .
3.4. This subsection is a natural continuation of the previous subsection. We assume
in this subsection that {ψj } is an orthonormal basis for L2 (X) satisfying the condition
kψj k∞ ≤ Bj 1/2 , j = 1, 2, . . . and impose an extra condition µ ≤ C. Then instead of (3.14)
we have by Bernstein’s inequality
mη 2
(3.20) Probz∈Z m {|ĉj (z) − cj | ≥ η} ≤ 2 exp(− ).
C(M, B)(1 + ηj 1/2 )
ON MATHEMATICAL METHODS OF LEARNING 23
We use the same notations and definitions as above in subsection 3.3. Instead of (3.15) we
now have
mη 2
(3.21) Probz∈Z m { max |ĉj (z) − cj | ≤ η} ≥ 1 − 2N exp(− ).
j∈[1,N ] C(M, B)(1 + ηN 1/2 )
Similarly to the above we get relation (3.16) and relations (3.17), (3.18) for a fµ satisfying
(3.12). We now set
η = (A ln m/m)1/2 , N = [η −2 ] + 1.
r
Using (3.13) with a = 1+2r in the same way as we got (3.19) we obtain now a similar
estimate
r
(3.22) Probz∈Z m {kfµ − fz,η k2 ≤ C(R)(A ln m/m) 1+2r } ≥ 1 − w(m, A; B, R).
r
Assume µ ≤ C, fµ satisfies (3.10) with r ≤ R and satisfies (3.11) with a = 1+2r . Then
r
Probz∈Z m {kfµ − fz k2 ≤ C(R)(A ln m/m) 1+2r } ≥ 1 − w(m, A; B, R).
References
[BS] M.Sh. Birman and M.Z. Solomyak, Estimates of singular numbers of integral operators, Uspekhi
Mat. Nauk 32 (1977), 17–84; English transl. in Russian Math. Surveys 32 (1977).
[C] B. Carl, Entropy numbers, s-numbers, and eigenvalue problems, J. Funct. Anal. 41 (1981), 290–306.
[CS] F. Cucker and S. Smale, On the mathematical foundations of learning, Bulletin of AMS, 39 (2001),
1–49.
[DT] R.A. DeVore and V.N. Temlyakov, Nonlinear approximation by trigonometric sums, J. Fourier
Analysis and Applications 2 (1995), 29–48.
[K] B.S. Kashin, Widths of certain finite-dimensional sets and classes of smooth functions, Izv. Akad.
Nauk SSSR, Ser.Mat. 41 (1977), 334–351; English transl. in Math. USSR IZV. 11 (1977).
[PS] T. Poggio and S. Smale, The Mathematics of Learning: Dealing with Data, manuscript (2003),
1–16.
[T1] V.N. Temlyakov, Approximation by elements of a finite dimensional subspace of functions from
various Sobolev or Nikol’skii spaces, Matem. Zametki 43 (1988), 770–786; English transl. in Math.
Notes 43 (1988), 444–454.
[T2] Temlyakov V.N., On universal cubature formulas, Dokl. Akad. Nauk SSSR 316 (1991), no. 1;
English transl. in Soviet Math. Dokl. 43 (1991), 39–42.
[T3] V.N. Temlyakov, Approximation of periodic functions, Nova Science Publishes, Inc., New York,
1993.
[T4] V.N. Temlyakov, Nonlinear Methods of Approximation, Found. Comput. Math. 3 (2003), 33–107.
[T5] V.N. Temlyakov, Nonlinear Kolmogorov’s widths, Matem. Zametki 63 (1998), 891–902.
[Z] A. Zygmund, Trigonometric series, University Press, Cambridge, 1959.