Ridgelets Theory and Applicationsthesis
Ridgelets Theory and Applicationsthesis
Ridgelets Theory and Applicationsthesis
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
ii
I certify that I have read this dissertation and that in my opinion
it is fully adequate, in scope and in quality, as a dissertation for
the degree of Doctor of Philosophy.
David L. Donoho
(Principal Adviser)
Iain M. Johnstone
George C. Papanicolaou
iii
Abstract
Single hidden-layer feedforward neural networks have been proposed as an approach to
bypass the curse of dimensionality and are now becoming widely applied to approximation
or prediction in applied sciences. In that approach, one approximates a multivariate target
function by a sum of ridge functions this is similar to projection pursuit in the literature
of statistics. This approach poses new and challenging questions both at a practical and
theoretical level, ranging from the construction of neural networks to their e ciency and
capability. The topic of this thesis is to show that ridgelets, a new set of functions, provide
an elegant tool to answer some of these fundamental questions.
In the rst part of the thesis, we introduce a special admissibility condition for neural
activation functions. Using an admissible neuron, we develop two linear transforms, namely
the continuous and discrete ridgelet transforms. Both transforms represent quite general
functions f as a superposition of ridge functions in a stable and concrete way. A frame of
\nearly orthogonal" ridgelets underlies the discrete transform.
In the second part, we show how to use the ridgelet transform to derive new approxi-
mation bounds. That is, we introduce a new family of smoothness classes and show how
they model \real-life" signals by exhibiting some speci c sorts of high-dimensional spatial
inhomogeneities. Roughly speaking, nite linear combinations of ridgelets are optimal for
approximating functions from these new classes. In addition, we use the ridgelet transform
to study the limitations of neural networks. As a surprising and remarkable example, we
discuss the case of approximating radial functions.
Finally, it is explained in the conclusion why these new ridgelet expansions o er decisive
improvements over traditional neural networks.
iv
Acknowledgements
First, I would like to thank my advisor David Donoho whose most creative and original
thinking have been for me a great source of inspiration. I admire his deep and penetrating
views on so many areas of the mathematical sciences and feel particularly indebted to him
for sharing his thoughts with me. Beyond the unique scientist, there is the friend whose
kindness and generosity throughout my stay at Stanford have been invaluable. I also extend
my gratitude to his wife, Miki.
I feel privileged to have had so many fantastic teachers and professors who nurtured
my love and interest for science. I owe special thanks to Patrick David and to Professor
Yves Meyer who shared their enthusiasm with me { a quality that I hope will be a lifetime
companion.
I would also like to thank Professors Jerome Friedman, Iain Johnstone and George
Papanicolaou for serving on my orals committee and for having, together with Professor
Darrell Du e, written letters of recommendation on my behalf.
I wish to thank all the people of the Department of Statistics for creating such a world-
class scienti c environment in which it is so easy to blossom especially, the faculty which
greatly enriched my scienti c experience by exposing me to new areas of research.
A short acknowledgement seems to be very little to thank my parents for their constant
love and support, and for the never-failing con dence they had in me.
My days at Stanford would not have been the same without Helen, for the countless
little things she did so that I would feel \at home." I praise the courage she found to read
and suggest improvements to this manuscript.
Finally, my deepest gratitude goes to my wife, Chiara, whose encouragement, humor
and love have made these last four years a pure enjoyment.
v
Contents
Abstract iv
Acknowledgements v
1 Introduction 1
1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Statistical Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Projection Pursuit Regression (PPR) . . . . . . . . . . . . . . . . . . 4
1.3.2 Neural Nets Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Statistical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Harmonic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 A Continuous Representation . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Discrete Representation . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.4 Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 The Continuous Ridgelet Transform 13
2.1 A Reproducing Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 A Parseval Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 A Semi-Continuous Reproducing Formula . . . . . . . . . . . . . . . . . . . 20
3 Discrete Ridgelet Transforms: Frames 23
3.1 Generalities about Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Discretization of ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
CONTENTS vii
7 Concluding Remarks 98
7.1 Ridgelets and Traditional Neural Networks . . . . . . . . . . . . . . . . . . 98
7.2 What About Barron's Class? . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Unsolved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.2 Curved Singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A Proofs and Results 107
References 113
List of Figures
2.1 Ridgelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Ridgelet discretization of the frequency plane . . . . . . . . . . . . . . . . . 25
ix
Chapter 1
Introduction
Let f (x) : Rd ! R be a function of d variables. In this thesis, we are interested in
constructing convenient approximations to f using a system called neural networks. This
problem is of wide interest throughout the mathematical sciences and many fundamental
questions remain open. Because of the extensive use of neural networks, we will address
questions from various perspectives and use these as guidelines for the present work.
1
CHAPTER 1. INTRODUCTION 2
From a mathematical point of view, such approximations amount to taking nite linear
combinations of atoms from the dictionary DRidge = f(k x ; b) k 2 Rd b 2 Rg of
elementary ridge functions. As is known, any function of d variables can be approximated
arbitrarily well by such combinations (Cybenko, 1989 Leshno, Lin, Pinkus, and Schocken,
1993). As far as constructing these combinations, a frequently discussed approach is the
greedy algorithm that, starting from f0 (x) = 0, operates in a stepwise fashion running
through steps i = 1 : : : m we inductively de ne
arg 0min
1
arg min
(kb)2Rn R
kf ; fi
;1 + (1 ; )(k x ; b)k2 : (1.3)
Thus, at the i-th stage, the algorithm substitutes to fi;1 a convex combination involving
fi;1 and a term from the dictionary DRidge that results in the largest decrease in approx-
imation error (1.3). It is known that when f 2 L2 (D) with D a compact set, the greedy
algorithm converges (Jones, 1992b) it is also known that for a relaxed variant of the greedy
algorithm, the convergence rate can be controlled under certain assumptions (Jones, 1992a
Barron, 1993). There are unfortunately two problems with the conceptual basis of such
results.
First, they lack the constructive character which one ordinarily associates with the
word \algorithm." In any assumed implementation of minimizing (1.3) one would need to
search for a minimum within a discrete collection of k and b. What are the properties of
procedures restricted to such collections? Or, more directly, how nely discretized must the
collection be so that a search over that collection gives results similar to a minimization over
the continuum? In some sense, applying the word \algorithm" for abstract minimization
procedures in the absence of an understanding of this issue is a misnomer.
Second, even if one is willing to forgive the lack of constructivity in such results, one
must still face the lack of stability of the resulting decomposition. An approximant fN (x) =
P N (k x ; b ) has coe cients which in no way are continuous functionals of f and
i=1 i i i
do not necessarily re ect the size and organization of f (Meyer, 1992).
CHAPTER 1. INTRODUCTION 3
kf ; fN k2 2CN ;1=2
(1.4)
where fN is the output of the algorithm at stage N . Their result, however, also raises a set
of challenging questions which we will now discuss.
The greedy algorithm. The work of DeVore and Temlyakov (1996) shows that the greedy
algorithm has unfortunately very weak approximation properties. Even when good approxi-
mations exist, the greedy algorithm cannot be guaranteed to nd them, even in the extreme
case where f is just a superposition of a few, say ten, elements of our dictionary DRidge .
Neural nets for which functions? It can be shown that for the class Barron considers, a
simple N -term trigonometric approximation would give better rates of convergence namely,
O(N ;(1=2+1=d) ) (and, of course, there is a real and fast algorithm). So, it would be of interest
to be able to identify functional classes for which neural networks are more e cient than
other methods of approximation or more ambitiously a class F for which it could be proved
that linear combinations of elements of DRidge give the best rate of approximation over F .
CHAPTER 1. INTRODUCTION 4
Yi = f (Xi ) + i (1.5)
where is the noisy contribution, one wishes to estimate the unknown smooth function f .
It is observed that well-known regression methods such as kernel smoothing, nearest-
neighbor, spline smoothing (see H!ardle, 1990 for details) may perform very badly in high
dimensions because of the so-called curse of dimensionality. The curse comes from the fact
that when dealing with a nite amount of data, the high-dimensional ball d is mostly
empty, as discussed in the excellent paper of Friedman and Stuetzle (1981). In terms of
estimation bounds, roughly speaking, the curse says that unless you have an enormous
sample size N , you will get a poor mean-squared error, say.
where the uj 's are vectors of unit length, i.e. kuj k = 1. The algorithm, the statistical
analogy of (1.2)-(1.3), also operates in a stepwise fashion. At stage m, it augments the t
fm;1 by adding a ridge function gj (uj x) obtained as follows: calculate the residuals of
P
the m ; 1th t ri = Yi ; mj =1;1 gj (uj Xi ) and for a xed direction u plot the residuals ri
against u xi t a smooth curve g and choose the best direction u, so as to minimize the
P
residuals sum of squares i (ri ; g(u Xi ))2 . The algorithm stops when the improvement
is small.
The approach was revolutionary because instead of averaging the data over balls, PPR
performs a local averaging over narrow strips: ju x ; tj h, thus avoiding the problems
relative to the sparsity of the high-dimensional unit ball.
where kj 2 Rd and bj 2 R, so that the t is exactly like (1.1). Again, the sigmoid is most
commonly used for .
Of course, PPR and neural nets regression are of the same avor as both attempt to
approximate the regression surface by a superposition of ridge functions. One of the main
di erences is perhaps that neural networks allow for a non-smooth t since (k x ; b)
resembles a step function when the norm kkk of the weights is large. On the other hand,
PPR can make better use of projections since it bears the freedom to choose a di erent
pro le g at each step.
CHAPTER 1. INTRODUCTION 6
1.5 Achievements
The thesis is about the important issues that have just been addressed. Our goal here is
to apply the concepts and methods of modern harmonic analysis to tackle these problems,
starting with the primary one: the problem of constructing neural networks.
Using techniques developed in group representations theory and wavelet analysis, we
develop two concrete and stable representations of functions f as superpositions of ridge
functions. We then use these new expansions to study nite approximations.
increasing, such an admissible activation function is oscillating, taking both positive and
negative values. In fact, our condition requires for a number of vanishing moments which
are proportional to the dimension d, so that an admissible has zero integral, zero `average
slope,' zero `average curvature,' etc. in high dimensions.
We show that if one is willing to abandon the traditional sigmoidal neural activation
function , which typically has no vanishing moments and is not in L2 , and replace it by an
admissible neural activation function , then any reasonable function f may be represented
exactly as a continuous superposition from the dictionary DRidgelet = f : 2 ;g of
ridgelets (x) = a;1=2 ( uxa;b ) where the ridgelet parameter = (a u b) runs through the
set ; f(a u b) a b 2 R a > 0 u 2 Sd;1 g with Sd;1 denoting the unit sphere of Rd .
In short, we establish a continuous reproducing formula
Z
f = c hf i (d ) (1.6)
These two formulas mean that we have a well-de ned continuous Ridgelet transform R(f )( ) =
hf i taking functions on Rd isometrically into functions of the ridgelet parameter =
(a u b).
with equality in the L2 (D) sense. This representation is stable in the sense that the co-
e cients change continuously under perturbations of f which are small in L2 (D) norm.
Underlying the construction of such a discrete transform is, of course, a quasi-Parseval
relation, which in this case takes the form
X
Akf k2L2 (D) jhf iL (D)j2 Bkf k2L (D)
2 2 (1.9)
2;d
Equation (1.8) follows by use of the standard machinery of frames (Du n and Schae er,
1952 Daubechies, 1992). Frame machinery also shows that the coe cients are realiz-
able as bounded linear functionals (f ) having Riesz representers ~ (x) 2 L2 (D). These
representers are not ridge functions themselves but by the convergence of Neumann series
underlying the frame operator, we are entitled to think of them as molecules made up of
linear combinations of ridge atoms, where the linear combinations concentrate on atoms
with parameters 0 \near" .
1.5.3 Applications
As a result of Chapters 2 and 3, we are, roughly speaking, in a position to e ciently
construct nite approximations by ridgelets which give good approximations to a given
function f 2 L2 (D). One can see where the tools we have constructed are heading: from
the exact series representation (1.8), one aims to extract a nite linear combination which
is a good approximation to the in nite series once such a representation is available, one
has a stable, mathematically tractable method of constructing approximate representations
of functions f based on systems of neuron-like elements.
New functional classes. Rephrasing a comment made in section 1.2, it is natural to ask
for which functional classes do ridgelets make sense. That is, what are the classes they
approximate best? To explain further what we mean, suppose we are given a dictionary
D = fg 2 #g. For a function f , we de ne its approximation error by N -elements of the
dictionary D by
X
N
infN infN kf ; i gi kH dN (f D): (1.10)
(i )i=1 (i )i=1 i=1
CHAPTER 1. INTRODUCTION 10
Suppose now that we are interested in the approximation of classes of functions characterize
the rate of approximation of the class F by N elements from D by
extract the N -term approximation f~N where one only keeps the dual-ridgelet terms
corresponding to the N largest ridgelet coe cients hf i then, the approximant f~N
achieves the optimal rate of approximation over our new classes.
In Chapter 4, we give a description of these new spaces in terms of the smoothness of the
Radon-transform of f . Furthermore, we explain how these spaces model functions that are
singular across hyperplanes when there may be an arbitrary number of hyperplanes which
may be located in any spatial positions and may have any orientations.
Specic examples. We study degrees of approximations over some speci c examples. For
example, we will show in Chapter 5 that the goals set in section 1.4 are ful lled. Although
ridgelets are optimal for representing objects with singularities across hyperplanes, they
fail to represent e ciently singular radial objects (Chapter 6) i.e., when singularities are
associated with spheres and more generally with curved hypersurfaces. In some sense, we
cannot curve the singular sets.
Superiority over traditional neural nets. In Neural Networks, one considers approxima-
tions by nite linear combinations taken from the dictionary DNN = f(k x ; b) k 2
CHAPTER 1. INTRODUCTION 11
Rn b 2 Rg, where is the univariate sigmoid, see Barron (1993) for example. It is shown
that for any function f 2 L2 (d ), there is a ridgelet approximation which is at least as good
- and perhaps much better - as the best ideal approximation using Neural Networks.
1.5.4 Innovations
Underlying our methods is the inspiration of modern harmonic analysis { ideas like the
Calder'on reproducing formula and the Theory of Frames. We shall brie y describe what is
new here { that which is not merely an `automatic' consequence of existing ideas.
First, there is, of course, a general machinery for getting continuous reproducing formu-
las like (1.6), via the theory of square-integrable group representations (Du o and Moore,
1976 Daubechies, Grossmann, and Meyer, 1986). Such a theory has been applied to de-
velop wavelet-like representations over groups other than the usual ax + b group on Rd , see
Bernier and Taylor (1996). However, the particular geometry of ridge functions does not
allow the identi cation of the action of ; on with a linear group representation (notice
that the argument of is real, while the argument of is a vector in Rd ). As a conse-
quence, the possibility of a straightforward application of well-known results is ruled out.
As an example of the di erence, our condition for admissibility of a neural activation func-
tion for the continuous ridgelet transform is much stronger { requiring about d=2 vanishing
moments in dimension d { than the usual condition for admissibility of the mother wavelet
for the continuous wavelet transform, which requires only one vanishing moment in any
dimension.
Second, in constructing frames of ridgelets, we have been guided by the theory of
wavelets, which holds that one can turn continuous transforms into discrete expansions
by adopting a strategy of discretizing frequency space into dyadic coronae (Daubechies,
1992 Daubechies, Grossmann, and Meyer, 1986) this goes back to Littlewood-Paley (Fra-
zier, Jawerth, and Weiss, 1991). Our approach indeed uses such a strategy for dealing with
the location and scale variables in the ;d dictionary. However, in dealing with ridgelets
there is also an issue of discretizing the directional variable u that seems to be a new ele-
ment: u must be discretized more nely as the scale becomes ner. The existence of frame
bounds under our discretization shows that we have achieved, in some sense, the `right'
discretization, and we believe this to be new and of independent interest.
Third, as emphasized in the previous two paragraphs, one has available a new tool
to analyze and synthesize multivariate functions. While wavelets and related methods
CHAPTER 1. INTRODUCTION 12
work well in the analysis and synthesis of objects with local singularities, ridgelets are
designed to work well with conormal objects: objects that are singular across some family
of hypersurfaces, but smooth along them. This leads to a more general and super cial
observation: the association between neural nets representations and certain types of spatial
inhomogeneities seems, here, to be a new element.
Next, there is a serious attempt in this thesis to characterize and identify functional
classes that can be approximated by neural nets at a certain rate. Unlike well grounded area
of approximation theory, neural network theory does not solve the delicate characterization
issue. In wavelet or spline theory, it is well known that the e ciency of the approximation
is characterized by classical smoothness (Besov spaces). In contrast, it is necessary in
addressing characterization issues of neural nets approximation to abandon the classical
measure of smoothness. Instead, we propose a new one and de ne a new scale of spaces
based on our new de nition. In addition to providing a characterization framework, these
spaces to our knowledge are not studied in classical analysis and their study may be of
independent interest.
We conclude this introduction by underlining perhaps the most important aspect of the
present thesis: ridgelet expansion and approximation are both constructive and e ective
procedures as opposed to existential approximations commonly discussed in the neural
networks literature (see section 1.1).
Chapter 2
The Continuous Ridgelet
Transform
In this chapter we present results regarding the existence and the properties of the contin-
uous representation (1.6). Recall that we have introduced the parameter space
and the notation (x) = a;1=2 ( uxa;b ). Of course, the parameter = (a u b) has a nat-
ural interpretation: a indexes the scale of the ridgelet u, its orientation and b, its location.
The measure (d ) on neuron parameter space ; is de ned by (d ) = ada d+1 d du db, where
d is the surface area of the unit sphere S in dimension d and du the uniform probability
d ; 1
R
measure on Sd;1 . As usual, fb( ) = e;ix f (x)dx denotes the Fourier transform of f and
F (f ) as well. To simplify notation, we will consider only the case of multivariate x 2 Rd
with d 2. Finally, we will always assume that : R ! R belongs to the Schwartz space
S (R). The results presented here hold under weaker conditions on , but we avoid study
of various technicalities in this chapter.
We now introduce the key de nition of this chapter.
De nition 1 Let : R ! R satisfy the condition
Z j b()j2
K = jjd d < 1: (2.1)
Then, by Fubini
Z Z
I = 1 b
expfiu xgf (u) j b(a)j2 da
ad 1f >0g dd du
Z
= 1 expfiu xgfb(u)K j jd;1 1 d du
f >0g d
Z
= 1 K expfix kgfb(k)dk
Rd
1
= K (2 )d f (x):
Integral representations like (2.2) have been independently discovered in Murata (1996).
wau is integrable, being the convolution between two integrable functions, and belongs to
L2 (R) since kwauk2 kf k1k a k2 its Fourier transform is then well deZ ned and wbau ( ) =
ba ()fb(u). By the usual Plancherel theorem, R jwau(b)j2 db = 1 jwbau()j2 d and,
2
hence,
Z da Z
I = 21 b b
jf (u)j j a()j ad+1 ddu d = 2
2 2 2
f >0g
jfb(u)j2j b(a)j2 da
ad d du d:
CHAPTER 2. THE CONTINUOUS RIDGELET TRANSFORM 17
Since
R j b(a)j2 da = K jjd;1 (admissibility), we have
ad
2 K Z
I = 2 jfb(u)j2d;1ddu = 1 K (2 )dkf k22:
The assumptions on f in the above two theorems are somewhat restrictive, and the
basic formulas can be extended to an even wider class of objects. It is classical to de ne
the Fourier Transform rst for f 2 L1 (Rd ) and only later to extend it to all of L2 using the
fact that L1 \ L2 is dense in L2 . By a similar density argument, one obtains
Proposition 1 There is a linear transform R: L2(Rd ) ! L2(; (d )) which is an L2-
isometry and whose restriction to L1 \ L2 satises
R(f )( ) = hf i:
For this extension, a generalization of the Parseval relationship (1.7) holds.
Proposition 2 (Extended Parseval) For all f g 2 L2(Rd ),
Z
hf gi = c R(f )( )R(g)( )(d ): (2.4)
Proof of Proposition 2. Notice that one needs only to prove the property for a dense
subspace of L2 (Rd ) i.e., L1 \ L2 (Rd ). So let f g 2 L1 \ L2 we can write
Z Z
R(f )( )R(g)( )(d ) = h ea f ea gi adad+1 ddu = I:
\\
Applying Plancherel
Z
I = 21 h ea f ea gi adad+1 ddu
Z
= 21 fb(u)gb(u)aj b(a)j2 da
a d+1 d du d
dkf k1 "
1
2
is meaningful, since f g 2; is uniformly L1 bounded over ;" . The next theorem makes
more precise the meaning of the reproducing formula.
Theorem 3 Suppose f 2 L1 \ L2(Rd ) and admissible.
(1) f" 2 L2(Rd ), and
CHAPTER 2. THE CONTINUOUS RIDGELET TRANSFORM 19
(2 )1=2 2
Repeating the argument in the proof of Theorem 1, we get
u u \
we start proving that f" 2 L2 (Rd ). Notice that Ru (f ) = Ru f Ru and Ru (t) =
1 expf; t2 g . Now F (R f R )( ) = (Rdf R )( ) = fb(u) expf; 2 g.
u u 2
which allows the interpretation of f" as the \conjugate" Fourier transform of an L2 element
and therefore the conclusion f" 2 L2 (Rd ).
Step 2 We aim to prove that f" ! f" pointwise and in L2 (Rd ). The dominated conver-
gence theorem leads to
Then by the Fourier Transform isometry, we have f" ! (2 );d FT (c" fb) in L2 (Rd ). It
remains to be proved that this limit, which we will abbreviate with g" , is indeed f" :
Z
jf" (x) ; f"(x)j = c
;"
(hf i ; hf i) (d )
Z "; Z
ea (Ruf Ru ; Ruf )k1 da
1
c sup j (x)j
2;" " Sd;
k1 ad+1 d du
Z "; Z
k eak1kRuf Ru ; Ruf k1 adad+1 ddu
1
c "; 12 k k "1
Sd;1
Z "; 1
da Z
= c " ; 12
kk "1
ad+ 21
k k1 Sd; kRuf Ru ; Ruf k1ddu:
1
From jf" (x) ; f"(x)j (")k k k1 Sd; kRu f Ru ; Ru f k1 d du, we obtain kf" ;
1 1
f"k ! 0 as ! 0. Note that the convergence is in C (Rd ) as the functions are continuous.
1
Finally, we get f" = g" and, therefore, f" is in L2 (Rd ) by completeness.
To show that kf" ;f k2 ! 0 as " ! 0, it is necessary and su cient to show that kfb" ;fbk2 ! 0,
Z
kf" ; f k2 = jfb(k)j2 (1 ; c"(kkk)2 dk:
b b 2
Recalling that 0 c" 1 and that c" " 1 as " ! 0, the convergence follows.
and the sense in which the above equation holds. Now, one can obtain a semi-continuous
version of (2.5) by replacing the continuous scale by a dyadic lattice. The motivation for
CHAPTER 2. THE CONTINUOUS RIDGELET TRANSFORM 21
doing so will appear in the later chapters. Let us choose such that
X j ^(2;j )j2
j2 j jd
; ;1
= 1: (2.6)
j 2Z
Of course, this condition greatly resembles the admissibility condition (2.1) introduced
earlier. If one is given a function ( such that
X
j((2
^ j )j2 = 1
;
j 2Z
it is immediate to see that de ned by ^( ) = j j(d;1)=2 (^ ( ) will verify (2.6). Now, using
the same argument as for Theorems 1 and 3, the property (2.6) implies
X Z
f/ 2j (d;1) hf (x) 2j (2j (u x ; b))i2j (2j (u x ; b))dudb
j 2Z
where again if f 2 S (Rd ), the inequality holds in a pointwise way and more generally if
f 2 L1 \ L2 (Rd ), the partial sums of the right-hand side are square integrable and converge
to f in L2 . Finally, as in wavelet theory, it will be rather useful to introduce some special
coarse scale ridgelets. We choose a pro le ' so that
j<0
Notice, the above equality implies j'^( )j2 j jd;1 , which is very much unlike Littlewood-
Paley or wavelet theory: our coarse scale ridgelets are also oscillating since '^ must have
some decay near the origin, that is, ' itself must have some vanishing moments. (In fact,
' is \almost" an Admissible Neural Activation Function: compare with (2.1)).
For a pair (' ) satisfying (2.7), we have the following semi-continuous reproducing
CHAPTER 2. THE CONTINUOUS RIDGELET TRANSFORM 22
formula
Z Z
f/ hf (x) '(u x ; b)i'(u x ; b) + X 2j(d ;1)
hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(2.8)
where as in Littlewood Paley theory, j stands for 2j (2j ). At this point, the reader knows
in which sense (2.8) must be interpreted.
Chapter 3
Discrete Ridgelet Transforms:
Frames
The previous chapter described a class of neurons, the ridgelets f g 2; , such that
(i) any function f can be reconstructed from the continuous collection of its coe cients
hf i, and
(ii) any function can be decomposed in a continuous superposition of neurons .
The purpose of this chapter is to achieve similar properties using only a discrete set of
neurons ;d ;.
23
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 24
3.2 Discretization of ;
The special geometry of ridgelets imposes di erences between the organization of ridgelet
coe cients and the organization of traditional wavelet coe cients.
With a slight change of notation, we recall that = a1=2 (a(u x ; b)). We are looking
for a countable set ;d and some conditions on such that the quasi-Parseval relation (1.9)
holds. Let R(f )( ) = hf i then R(f )( ) = hRu f ab i with ab (t) = a1=2 (a(t ; b)).
Thus, the information provided by a ridgelet coe cient R(f )( ) is the one-dimensional
wavelet coe cient of Ru f , the Radon transform of f . Applying Plancherel, R(f )( ) may
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 25
be expressed as
1 a ;1=2 Z
R(f )( ) = 2 hRuf abi = 2 fb(u) b(=a) expfibgd
d b (3.3)
which corresponds to a one-dimensional integral in the frequency domain (see Figure 1).
In fact, it is the line integral of fbba0 , modulated by expfib g, along the line ftu : t 2
Rg. If, as in the Littlewood-Paley theory (Frazier, Jawerth, and Weiss, 1991) a = 2j and
supp( ) $1=2 2], it emphasizes a certain dyadic segment ft : 2j t 2j +1 g. In contrast,
in the multidimensional wavelets case where the wavelet ab = a; d2 ( x;a b ) with a > 0 and
b 2 Rd , the analogous inner product hf ab i corresponds to the average of fbba over the
whole frequency domain, emphasizing the dyadic corona f : 2j j j 2j +1 g.
j j+ 1 j+ 2
2 2 2
1
dierent coe cient functionals. There are more segments at ner scales.
Now, the underlying object f^ must certainly satisfy speci c smoothness conditions in
order for its integrals on dyadic segments to make sense. Equivalently, in the original domain
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 26
f must decay su ciently rapidly at 1. In this chapter, we take for our decay condition that
f be compactly supported, so that f^ is band limited. From now on, we will only consider
functions supported on the unit cube Q = fx 2 Rd kxk1 1g with kxk1 =maxi jxi j thus
H = L2(Q).
Guided by the Littlewood-Paley theory, we choose to discretize the scale parameter a as
fa0gj j0 (a0 > 1 j0 being the coarsest scale) and the location parameter b as fkb0a0;j gkj j0 .
j
Our discretization of the sphere will also depend on the scale: the ner the scale, the ner
the sampling over Sd;1 . At scale aj0 , our discretization of the sphere, denoted )j , is an
"j -net of Sd;1 with "j = 0a;0 (j;j0) for some 0 > 0. We assume that for any j j0 , the
sets )j satisfy the following Equidistribution Property: two constants kd Kd > 0 must exist
s.t. for any u 2 Sd;1 and r such that j r 2
d;1 d;1
kd "r jfBu(r) \ )j gj Kd "r : (3.4)
j j
On the other hand, if r j , then from Bu (r) Bu ( j ) and the above display, jfBu (r) \
d;1
)j gj Kd . Furthermore, the number of points Nj satis es kd "2j Nj Kd "2j d;1.
Essentially, our condition guarantees that )j is a collection of Nj almost equispaced points
on the sphere Sd;1 , Nj being of order a(0j ;j0 )(d;1) . The discrete collection of ridgelets is
then given by
j=2
(x) = a0 (aj0 u x ; kb0 ) 2 ;d = f(aj0 u kb0aj0) j j0 u 2 )j k 2 Zg: (3.5)
In our construction, the coarsest scale is determined by the dimension of the space Rd .
De ning ` as supf 2k k 2 N and 2k < log2d2 g, we choose j0 s.t. aj00 +1 ` < aj00 +2 . Finally,
we will set 0 = 1=2 so that j = a;0 (j ;j0 ) =2.
Remark. Here, we want to be as general as possible and that is the reason why we do
not restrict the choice of a0 . However, in Littlewood Paley or wavelet theory, a standard
choice corresponds to a0 = 2 (dyadic frames). Likewise, and although we will prove that
there are frames for any choice of a0 , we will always take a0 = 2 in the analysis we develop
in the forthcoming chapters.
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 27
2;d
v
uZ X
j
v
uZ X
2 u
1 t
R j j0 u2
jf^(u)j2 j ^(a0 )j2dju
;j
t
R j j0 u2
jf^(u)j2ja0 j j2j ^(a0 j )j2d
; ;
(3.7)
j j
The argument is a simple application of the analytic principle of the large sieve (Mont-
gomery, 1978). Note that it presents an alternative to Daubechies' proof of one-dimensional
dyadic a ne frames (Daubechies, 1992). We rst recall an elementary lemma that we state
without proof.
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 28
Lemma 2 Let g be a real valued function in C 1$0 ] for some > 0: then,
1 Z 1 Z
jg(=2) ; g(x)dxj 2 jg0 (x)jdx:
0 0
0 (a0 x). The ridgelet coe cient is then hf i = (Ru f j )(kb0 a0 ).
Again, let j (x) be aj= 2 j ;j
a j Z (k+1=2)b0 a0 ;j
1 Z (k+1=2)b0 a0 ;j
Fj (kb0 a0 ) ; b 0
;j
Fj (b)db 2 (k;1=2)b a;j jFj0(b)jdb:
0 (k;1=2)b0 a;0 j 0 0
Hence, if we sum the above expression over u 2 )j and j and apply the Cauchy-Schwartz
inequality to the right-hand side, we get the desired result.
We then show that there exist A0 B 0 > 0 s.t. for any f 2 L2 (Q) we have
X Z 1 b 2 b ;j 2
A0 kfbk22 f (u) (a0 ) d B 0 kfbk22 (3.8)
j j u2 j ;1
X Z 1 b 2 ;j 2 b ;j 2
0
And 1;`
1+` = 2e;`d ; 1.
Lemma 3 Let F 2 B12(Rd ) and f g
m m2Zd be a sequence of Rd such that
supm2Zd k m ; m k < logd 2 then
1
X
(1 ; ` )2 d kF k22
;
jF ( m )j2 (1 + `)2 dkF k22 ;
(3.11)
m2Zd
Proof of Lemma 3. The Polya-Plancherel theorem (see Plancherel and P'olya, 1938, Page
116) gives X
jF (m )j2 = ;dkF k22:
m2Zd
Let k denote the usual multi-index (k1 : : : kd ) and let jkj = k1 + + kd , k! = k1 ! : : : kd !
and xk = xk11 : : : xkdd . For any k, @ k F is an entire function of type . Moreover, Bernstein's
inequality gives k@ k F k2 kF k2 see Boas (1952, Page 211) for a proof. Since F is an entire
function of exponential type, F is equal to its absolutely convergent Taylor expansion.
Letting s be a constant to be speci ed below, we have
X @ k F (m
)(
F ( m ) ; F (m ) = k ! m ; m)k
jkj 1
X @ k F (m ) ks
jkj
= k ( m ; m) k :
jk j 1
! s j j
jk j1 k !s
2jkj
jk j 1
k!
= ;d
kk ;
F 22 (ed s2 1)(ed`2 s2
1
; 1):
We choose s2 = 1` . If ` = e`d ; 1 < 1 then
X
jF ( m) ; F (m )j2 2` dkF k22 ;
m2Zd
and, by the triangle inequality, the expected result follows.
Let be a measure on Rd will be called d-uniform if there exist > 0 such that
(Qzm (`))=(2`)d . The following result is completely equivalent to the previous
theorem.
Corollary 1 Fix ` < log2 with
d 2` an integer. Let F 2 B12(Rd) and be an d-uniform
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 31
:
We have
X Z b ;j 2
(+x(` `=2)) = (a0 ) 1 x (``=2) (u)d
j j u2 j
X a(0j;j ) ` d;1 Z
0
kd
0
b(a;0 j ) 2 d
j jx j j+`
Z jj d;1 X j b(a00 j0 a0 jx )j2 d:
; ;
;j0
kd (a0 `)d;1 0 ja0 a0 j
j j+`
j jx d 1
; ; ;
j0
Now, since by assumption, ` , we have 8 j j 2 $ + `] `a;0 (j0 +1) ja;0 jx j 2`a;0 j0 .
We recall that `a;0 (j0 +1) 1. Therefore,
;j0
X j b(a;0 j0 )j2
(+x(` `=2)) kd (a0 `)d;1 2` inf
j 0 0 a0
`a;0 (j0 +1) j ;j
j2`a0 0 j ;j 0
jd;1
;j0 d ;1
X b(a;0 j0 ) 2 j j
kd(a0 `) 2` 1jinfja :
0 0 a
j 0 0
;j 0 d;1
j j
Similarly, we have
X Z b ;j 2
(a0 ) 1 x (``=2) (u)d
j jxu2 j
X j b(a;0 j0 )j2
Kd(a0 ;j0
`)d;1 2d;1 2` sup
j ;j 0
jd;1
0 j 0 0 a0
`a;(j0 +1) j ;j
j2`a0 0
X b(a;0 j0 ) 2 j j
;j0 d ;1 d;1
Kd (a0 `) 2 2` sup
1j ja0 j 0 2Z a; j 0 d;1 : j j
0
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 33
We nally consider the case of the j 's s.t. j0 j < jx . We recall that in this case, we have
jfB"u(`=2) \ )j gj Kd, and thus
X Z Z X b jx;j ;jx 2
b(a;0 j ) 2 1 x (``=2) (u)d Kd ( a0 a 0 )
j0 j<jxu2 j j j+` j j<jx
X b j0 2
0
Kd 2` sup (a0 )
`a;0 (j0 +1) j ;j
j2`a0 0 j 0>0
X b(aj00 ) 2 :
Kd 2` sup
1j ja0 j 0 >0
Finally, we need to prove the result for the cube Q0 (`). In order to do so, we need to
establish two last estimates:
X Z
(B0 (`)) = j)j j b(a;0 j ) 2 d
j j fj j`g
0
(j ;j )(d;1)
Z X b ;j 2
kd a0 0
(a0 ) d
fj j`g j j0
Z
d;1 X j b(a0 a0 )j
;j 0 ;j 2
ja0 j j
0
= kd ; 0
d
fj j g ` j 0 0 a0 a0 j
;j 0 ;j0 d;1
j
Z
j X jjb j
;j 0 ;j0 2
kd j
;j0 d;1
a0 ( a 0 a0 ) d
f`=a0 j j`g
0
;j ;j0 d;1
j 0 0 a0 a0 j
X j b(a;0 j0 a;0 j )j2
kd 2`(1 ; 1=a0 ) (`a0
0
;(j0 +1)
)d;1 inf :
;(j +1)
`a0 0 ;j
j j`a0 0 j0 0 ja0 j0 a0 j jd
; ; 0 ;1
Again, let fxj g1j J with kxj k ` s.t Q0 (`) 1 j J +xj (` `=2kxj k) B0(`) and Td0 be
the minimum number of j 's needed. We then have
0 < (B0 (`)) (Q0 (`)) (B0 (`)) + Tn0 sup +x (` 2k`xk ) < 1:
kxk `
3.6 Discussion
3.6.1 Coarse Scale Re nements
In Neural Networks, the goal is to synthesize or represent a function as a superposition of
neurons from the dictionary DRidge = f(k x ; b) k 2 Rd b 2 Rg, the activation function
being xed. That is, all the elements of DRidge have the same pro le . Likewise, as we
wanted to keep this property, there is a unique pro le for all the elements of our ridgelet
frame. However, it will be rather useful to introduce a di erent pro le ' for the coarse-
scale elements. For instance, following section 2.3, let us consider a function ' satisfying
the following assumptions:
'^()=jj(d 1)=2 = O(1)
; and j'^()j=jj(d ;1)=2 c if jj 1,
'^() = O((1 + jj) 2).
;
(where again )j is a set of \quasi-equidistant" points on the sphere, the resolution being
0 2;j ) is a frame for L2 (Q). The advantage of this description over the other (3.5) is the
fact that the coarse scale corresponds to j = 0 (and not upon some funny index j0 which
depends on the dimension). In our applications, we shall generally use (3.13) for its greater
comfort. As we will see, in addition to the frameability condition, we often require ' and
to have some regularity and to have a few vanishing moments.
We close this section by introducing a few notations that we will use throughout the
rest of the text. Indeed, it will be helpful to use the notation for '(ui x ; kb0 ). We will
make this abuse possible in saying that '(ui x ; kb0 ) corresponds to the scale j = ;1. For
j ;1 then, denote also by ;j the index set for the j th scale,
;j = f(j uji k) uji 2 )j k 2 Zg: (3.14)
f'(x1 cos i + x2 sin i ; kb0) 2j=2 (2j (x1 cos ij + x2 sin ij ; kb0 )
j 0 ij = 2 0 i2;j k 2 Zg (3.15)
where at scale j the ij 's are equispaced points on the circle, with step 2 0 2;j (take 0;1
to be an integer).
Proposition 5 Let ( ) be the frame dened by (3.15) and suppose that the pair (' )
satises (2.7), say. Then,
X 1
jhf ij2 ; 2 b0 0 kf k2
^2 C (0 1 + b0 1)kf^k22
; ;
(3.16)
where the constant C depends at most upon ' and . It follows from (3.16) that the frame
bounds ratio is bounded by 2C (0 + b0 ) for the same constant C .
The result links the decay of the frame bounds ratio to the oversampling factor (0 b0 );1 .
The proof of the proposition may be found in the Appendix.
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 37
The oversampling. The redundancy of the frame that one can construct by this
strategy depends heavily on the quality of the underlying \quasi-uniform" sampling of the
sphere at each scale j . The construction of quasi-uniform discrete point sets on spheres
has received considerable attention in the literature see Conway and Sloane (1988) and
additional references given in the bibliography. Quantitative improvements of our results
would follow from applying some of the known results obtained in that eld.
Computations. Another area for investigation has to do with rapid calculation of
groups of coe cients. Note that if the sets )j for j j0 present some symmetries, it
may not be necessary to compute e for all 2 ;d many dual elements would simply
be translations, rotations and rescalings of each other. This type of relationship would be
important to pursue for practical applications.
Taking the coarse scale into account amounts to add (in the previous display) jf^(u)j2 j'^( )j2
P
to the quantity j 0u2 j j j2s jf^(u)j2 j2;j j2 j^j2(2;j j2)sj at each of its occurence. Now, the
;j 2
rest of the proof is absolutely identical to the one of Theorem 4. Our irregular sampling ap-
plies and gives the desired result provided that ^( )=j js satis es the conditions of De nition
3.
Remark. Anticipating the next chapter, suppose for instance that ' and satisfy the
conditions listed at the very beginning of Chapter 4, section 4.1 then our constructed L2
frame (Theorem 4) is also a frame for every H0s with s 0.
kf ; fN k2 CN ;r
(3.18)
39
CHAPTER 4. RIDGELET SPACES 40
These conditions are standard in Littlewood-Paley theory except for the rst part of (iv).
Instead of requiring j'^j to be 1 in the neighborhood of the origin, we want j'^j=j j(d;1)=2 to
be 1 near the origin. As we now recall, the reason for this modi cation originates in section
1.3. At times, we will choose pairs (' ) satisfying the additional condition (2.7), namely
2 X ^ ;j )j2
(v) jj'^(jd;)j1 + jj2(2
;j jd;1
= 1:
j 0
Recall that (v) allows to represent any function in L1 \ L2 (Rd ) as a semi-continuous super-
position of ridgelets (2.8), i.e.
Z Z
f/ hf (x) (u x ; b)i(u x ; b) + X 2j(d ;1)
hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(4.1)
where again the notation j stands for 2j (2j ).
Keeping in mind that Ru f denotes the Radon transform of f de ned by
Z
Ru f (t) = f (x)dx
ux=t
we now turn to the main de nition of this section.
De nition 4 For s 0 and p q > 0, we say that f 2 Rpq
s if f 2 L1 and
Ave kRuf 'kp < 1 and 2js 2j (d;1)=2 Ave kRuf kp 1=p 2 `q (N): (4.2)
u u j p
Note that we have de ned our space as a subset of L1 . The major reason is that we truly
understand the ridgelet transform for elements of L1 it is beyond the scope of this thesis
to extend the transform to distributions. In our de nition, it would be possible to drop the
membership to L1 but the discussion that follows would have been more complicated and
even more technical without further enlightening our business.
Next, we de ne
8 9
<X js j(d;1)=2 1=p q=1=q
kf kRpqs = Ave
u
kR u f 'k p + :j 0 2 2 Ave
u
kRuf j kpp (4.3)
and we will prove later that this is a norm for a suitable range of parameters s p and q.
CHAPTER 4. RIDGELET SPACES 41
There is an homogeneous version of our de nition where the norm (4.2) is replaced by
8 91=q
< 1=p =
kf kR_ pqs = :X 2js2j(d;1)=2 Ave
q
u
kRuf j kpp : (4.4)
j 2Z
To be consistent with the literature of Functional Analysis, we will call these classes R_ pq s .
Remark 1. It is not hard to show that the de nition is independent of the choice of the
pair (' ) provided it satis es (i)-(iv). Furthermore, it is not necessary to assume (ii)-(iii)
i.e., to assume that '^ and ^ are compactly supported. As we will show later, one can
get equivalent de nitions to (4.2)-(4.3) with the relaxed assumption that has a su cient
number of vanishing moments.
Remark 2. Like the Sobolev scales (respectively Besov scales) where a norm is de ned
based on the Fourier transform (respectively wavelet transform), what we have done here is
merely to de ne a norm based on the ridgelet transform. Let Rj (u b) the ridgelet coe cient
de ned by wj (u b)(f ) = hf (x) j (u x ; b)i = Ru f .j (b) for j 0 (v(u b)(f ) = hf (x) '(u
x ; b)i). The quantity kf kRpq
s may simply be rewritten as
Z 1=p 8
< X Z 1=p!q 9
=1=q
kf kRpqs = jv(u b)(f )jpdudb + : 2js2j(d;1)=2 jwj (u b)(f )jpdudb
j 0
kf kpRspp Ave
u
kRuf kpBspp d; = :
+( 1) 2 (4.5)
CHAPTER 4. RIDGELET SPACES 42
We explain further what we mean by (4.5). In fact, one can easily be convinced that the
s norm (4.3) dominates its homogeneous version (4.4). Moreover, it is clear that
Rpq
in the usual sense of norm equivalence (B_ spp+(d;1)=2 is the corresponding homogeneous Besov
space). On the other hand, it is trivial to see that
kf kpRspp C Ave
u
kRuf kpBspp d; = :
+( 1) 2
That is the sense of our sloppy equivalence (4.5). In this thesis, we are mainly concerned
by the representation, the approximation or the estimation of functions of d-variables that
are compactly supported, say, in the unit ball d . We should note that in this case, the
distinction we have just made is irrelevant as both norms (the homogeneous one and the
non-homogeneous one) are equivalent and, therefore, (4.5) holds in the usual sense of norm
equivalence.
Along the same lines of consideration, one could quickly remark that the space R2s2 (Rd )
is, indeed, the classical Sobolev space H s (Rd ) = B2s2 (Rd ) since it easily follows from our
de nition that kf kR2s2 (Rd ) is an equivalent norm for H s . One could go at length over the
properties of the new spaces Rpq s , investigating for instance embedding relationships or
interpolation properties, etc. However, the goal of this chapter is to show that the spaces
s characterize objects with a special kind of inhomogeneities rather than exploring these
Rpq
issues.
Our characterization highlights an interesting aspect: the condition does not require
any particular smoothness of the Radon transform along the directional variable u.
Example. We now consider a simple example to illustrate the di erence between
traditional measures of smothness and our new criteria. Let f (x) = 1fx1 >0g (2 );d=2 e;kxk2 =2 .
From a classical point of view, this function has barely one derivative (the rst derivative
is a singular measure). To quantify this idea, one can show that f 2 Bs11 if, and only if,
s < 1. On the other hand, this object is quite smooth in regard to (4:3). In fact, f 2 Rs11
as long as s < 1 + (d ; 1)=2. We now give a detailed proof of this claim for we will make
an extensive use of some details appearing in the argument.
CHAPTER 4. RIDGELET SPACES 43
Proposition 6
f (x) = 1fx1 >0g(2 );d=2 e;kxk2 =2 2 Rs11 i s < 1 + (d ; 1)=2:
Proof of Proposition. Using the external characterization that we developed (Remark
4), it is su cient to show that
<d ) Ave
u
kRuf k1B < 1
11
and d ) Ave
u
kRuf k1B_ = 1:11
Let u be a unit vector. Next, let Q be an orthogonal matrix whose rst two columns are u
and u0 , where u0 is a unit vector, orthogonal to u and belonging to the span of fe1 ug { e1
being the rst element of the canonical basis of Rd . We will choose u0 so that u0 e1 0.
Now, let x0 be the change of coordinates de ned by x0 = QT x. We then rewrite the Radon
transform as
Z Z
Ru f (t) = f (x)dx = f (Qx0)dx0 :
ux=t x01 =t
To simplify the notations, let denote the Gaussian density ((t) = (2 );1=2 e;t2 =2 ) and /
R
be the cumulative distribution function, i.e. /(t) = t0 t (t0 )dt0 . With these notations, we
have
Ru f (t) = (t)/( p u e1 t 2 ):
1 ; (u e1 )
(In case u = e1 , the RHS is replaced by (t)1ft>0g .) We now use the following lemma.
Lemma 5 Let u be uniformly distributed on the unit sphere. Then the density of u1 = u e1
is given by
Now, the following trivial proposition will serve as a preliminary for a subsequent result.
Proposition 8 The space Rpq
s () equipped with the norm
{ inmum taken over all g 2 Rpq s (Rd ) in the sense of (4.7) { is a Banach space.
Proof of Proposition 8. It is clear that Rpq s () is a linear space and that kf k s
Rpq () is a
norm. We need to check for completeness. Let ffngn 1 be a Cauchy sequence and assume
without loss of generality that
Then we have
X
1
g = nlim
!1
gn = (gn+1 ; gn )
n=1
with
Remark. Let 1 be a bounded domain such that 1 . Suppose moreover, that both
and 1 are nice in a sense that, say, their boundary is C 1. In our de nition (4.7)-(4.8),
one can restrict g 2 Rpq
s (Rd ) to be such that
Doing so, we obtain the same space Rpq s () and an equivalent quasi norm. The reason is
that the multiplication of an element g 2 Rpq
s (Rd ) by a xed window w 2 C 1 is a bounded
0
operation from Rpq (R ) to itself. We will prove this later. In the application we have in
s d
mind, we shall consider the unit ball d as our domain and 2d as 1 .
In the previous section, we introduced a new scale of functional classes and presented some
basic properties. In this section, we will give a more intuitive characterization of these
spaces: we show how they model objects having a very special kind of singularities. To
describe these singularities, we introduce a de nition due to Donoho (1993).
De nition 6 Let > ;1=2. We say that : R ! R is a normalized singularity of degree
if (t) is C R on R n f0g where R = max(2 2 + ) and
j(t)j jtj 8t , and
dtdmm (t) (m + jj!) jtj ;m t 6= 0 m = 1 2 : : : R:
CHAPTER 4. RIDGELET SPACES 47
A normalized singularity is a smooth C R function which may or may not be singular at the
origin. Following Donoho (1993), we list a few examples of normalized singularities: jtj ,
jtj1ft > 0g , jtjw(t) where w(t) is a \nice" smooth function properly normalized, etc.
Additional examples are given in the reference cited above.
We make use of the de nition to construct a class of functions whose typical elements
are of the form (u x ; t).
De nition 7
SH = fX aii(ui x ; ti) X jaij 1 kik1 1g (4.9)
The proof of the theorem is fairly involved and requires several steps and composes the
remainder of this section. At the heart of the argument, there is the atomic decomposition
of the space Rs11 .
Moreover, the result is sharp in a sense that for s < d=2, there are elements in R1s1 (d )
that are not square integrable.
Proof of Lemma. It is su cient to prove the desired norm inequality (4.11) in R1s1 (Rd )
and such that g 2d . Indeed, let f be in R1s1 (d ) then for any g 2 R1s1 (Rd ) such that
gjd = f and supp g 2d , we have
kf k2 kgk2 C kgkRs (Rd):
11
And taking the in mum over the g's would give the desired result.
Step 1. We then prove the result in the case where g 2 L2 . We know that
Consequently,
Z8
< X
9
=
kgk22 C :kRug 'k1kRug 'k2 + 2j(d;1)2j=2kRug j k1kRug j k2 du:
j 0
But for any u 2 S d;1 , we have kRu g 'k2 C kgk2 and kRu g j k2 C kgk2 for some
positive constant C , see for example Candes (1996). Thus the inequality becomes
Z8
< X j(d;1) j=2
9
=
kgk2 C :kRug 'k1 + 2 2 kRug j k1 du:
j 0
The last inequality holds for any g satisfying the conditions speci ed above and, therefore,
we can conclude to the validity of (4.11).
Step 2. We nish the proof by density. Suppose g is in R1s1 (Rd ) and that supp g 2d ,
we show that
So, let + be a non-negative, radial, in nitely di erentiable and compactly supported func-
R
tion in Rd such that + = 1. Further, let be its Radon transform (because of spherical
symmetry, the Radon transform does not depend on u). We de ne + to be ;d +(= ) and
similarly = ;1 (= ). We rst show that
We have Ru (g + ) j = Ru g j (and similarly for the coarse scale, i.e. the convolution
with '). Now, it is immediate to check that for any j u and b, (Ru g j )(b) !
(Ru g j )(b) as ! 0 (and similarly when j is replaced by '). Further, we show that
lim kg + kR1s1 (Rd ) = kgkR1s1 (Rd ) in other words
8 9 8 9
< X = < X =
lim Ave kR g ' k1 + kRu g j k1 = Ave
!0 u : u
kR g 'k1 + kRug j k1 :
u : u
j 0 j 0
(4.14)
CHAPTER 4. RIDGELET SPACES 50
By Fatou's lemma, we get that lim inf kg + kR1s1 (Rd ) kgkRs (Rd). Conversely, since
11
kg +k2 C kg +kRs 11
and thus we deduce that fg +g is a Cauchy sequence in L2 and therefore converges in
L2 . On the other hand it is trivial that g + converges to g in L1 . Then, g + converges
to g in L2 and, therefore, we have proved (4.12). The lemma follows.
This lemma has a rather useful corollary.
Corollary 3 The space Rs11(d ) equipped with the norm
kf kRs (d) = inffkgkRs (Rd) 9g 2 Rpqs (Rd) with g = f g
11 11 j (4.15)
is a Banach space.
Proof of Corollary. For compactly supported functions, the L2 norm dominates the L1 norm.
Now from the previous lemma it is clear that the norm (4.15) and the one of Proposition 8
are equivalent. This proves the claim.
Actually, it is not hard to see that one can extend Lemma 6 and its corollary. For p 2,
s (d ) L2 (d ) as long as s > 1=p ; 1=2 (the argument being essentially the
we have Rpq
same as the one spelled out in the proof of Lemma 6) with continuous injection. Moreover,
for p > 2, the same is true as long as, this time, s > 0. Although we do not prove these
results here, we will use them in Chapter 5.
Proof of Lemma. Let w be in C01(2d ) such that its restriction to the unit ball is 1
(wjd = 1). It is enough to show that, say, the set fw(x)'(ux;b) w(x)2;js 2j 2 j (ux;b)g
d;1
is bounded in R1s1 (Rd ). The analysis is greatly simpli ed if one replaces the window w by
the d dimensional Gaussian density /d in fact, it is su cient to prove the lemma with the
gaussian window. Again, the reason is that the multiplication of an element in R1s1 (Rd )
with a xed C01 function is a bounded operation.
Lemma 8 Suppose that f 2 R1s1 and let be a bounded C 1 function together with its
derivatives. Then there exists a constant c such that
We postpone the proof of this intuitive lemma. As we shall see in the next chapter, we can
prove Lemma 7 directly without the help of the intermediate Lemma 8.
End of proof of Lemma 7. The norm k/d (x) j (u x ; b)kR1s1 being clearly invariant,
we will without loss of generality assume that u = e1 . Finally g will denote Ru f/d j (u
x ; b)g. Following the discussion presented in the last chapter, it is su cient to show that
there exists a positive constant C such that
In the remainder proof, will stand for the quantity s + (d ; 1)=2. Reproducing the
calculations from the last chapter, we rapidly get
Z q
g (t) = (t) (y)2j (2j (u1 t ; 1 ; u21y ; b)) dy
p p
We recall that v denotes the ratio u1 = 1 ; u21 , so that 1 ; u21 = (1 + v2 );1=2 .
case 1. a = 2j (1 + v2 );1=2 1. Then g can be written as
kf gkB kf kB kgkB 1 :
11 11
0
1
CHAPTER 4. RIDGELET SPACES 52
C kkB C 11
kg kB C (1 + v2)1=2(1 + v
11
;1
):
Z Z !
= C (y)(1 + jt ; yj) dy +
m ;
(y)(1 + jt ; yj) ;m
dy
t y > t =2 t y t =2
Z j; j jj
Z
j ; jj j
!
C
jt;yj>jtj=2
(y)(1 + jtj=2) dy + ;m
jt;yjjtj=2
j(y)kdy
Z !
C (1 + jtj=2) + y > t =2 j(y)kdy
;m
In the case where 1, the last expression implies that jg(k) (t)j C (1 + jtj) m : Therefore, ;
g 2 S (R) and is bounded in almost every known space and particularly B11 .
CHAPTER 4. RIDGELET SPACES 53
Now, the rescaling properties of Besov spaces gives that for > 0, there exists some
positive constant C such that for j j 1,
kf ( )kB C j j kf kB
11 11
for any f 2 B11 . It is then obvious that, from g (t) = 2j (t) g (2j u1 t), we can deduce
kg kB C 2j(+1)
11
C k g k
dv + Z k g k
dv
B B
j 1 1 (1 + v2 )d=2 jvj>2j 1 1
(1 + v2 )d=2
Zjvj2 Z !
C (1 + jvj ) ;1 dv + 2 j dv
jvj2j (1 + v2 )(d;1)=2 jvj>2j (1 + v2 )d=2
C 2j 2;j(d;1) + 2j 2;j(d;1) :
Now if we recall that = s + (d ; 1)=2, we have proven (4.16). The story is the same for
the coarse scale ridgelets '(u x ; b) and thus the proof of the lemma is complete.
Theorem 8 (Atomic Decomposition of R1s1(d )) Suppose that f 2 R1s1(d ) for s
d=2. Then, there exist numerical sequences such that
X
1
f (x) = k 2;jk s2jk d;2 1 jk (uk x ; bk) (4.17)
k=1
with the convergence in R1s1 . Moreover,
X
1
jk j 2(2 );d kf kR1s1 :
k=1
(We recall the convention we adopted in section 3.6.1: for jk = ;1, jk (uk x ; bk ) stands
for '(uk x ; bk ).)
CHAPTER 4. RIDGELET SPACES 54
We follow Frazier, Jawerth, and Weiss (1991) since our proof is a minor adaptation of
their argument. In order to prove our claim, we shall use a lemma from functional analysis
that can be obtained from the Hahn Banach theorem:
Lemma 9 Let K be a closed, convex and bounded subset of a Banach space B over the
reals. If x does not belong to K , then there exists a continuous real valued linear functional
such that supw2K l(w) < l(x).
Proof of the Theorem. So, let f be in R1s1 (d ) for s d=2. We then know that
f 2 L1 \ L2 (d ) and that we have
Z X Z
f/ hf (x) '(u x ; b)i'(u x ; b) + 2j (d;1) hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(4.18)
R
where the equality means has the following sense: S0 (f )(x) = hf (x) '(u x ; b)i'(u x ;
R
b)dudb 2 L2, for any j 0, +j (f )(x) = 2j(d;1) hf (x) j (u x ; b)i j (u x ; b)dudb 2 L2 ,
and
X
SJ (f ) = S0(f ) + +j (f ) ! f in L2 as J ! 1
0j J ;1
In addition, the convergence does also take place in R1s1 (d ). To see that, suppose g 2
R1s1 (Rd ) such that gjd = f . With the same notation as above, we have
X
g = S0 (g) + +j (g):
j 0
Z
= js
C2 2 j d;
2 jwj (g)(u b)jdudb
1
CHAPTER 4. RIDGELET SPACES 55
where the last inequality is a consequence of Lemma 7. We then immediatly have that
Z
kg ; SJ (g)kRs (d) C X C 2js2j d; jwj (g)(u b)jdudb ! 0
11
2
1
as J ! 1
j>J
by de nition of the norm k:kR1s1 (Rd ) . Since g (resp. SJ (g)) coincides with f (resp. SJ (f ))
on the unit ball, the convergence is established.
We are now in position to nish up the proof of the theorem. Let A be the set de ned
by
Now that we have been able to establish the atomic decomposition formula, the rest of the
proof is almost identical to the one of Donoho (1993). We reproduce it here for the sake of
completeness. We rst, suppose that ;1=2 0. De ne
m
!
= max sup sup dtd m (t) (m + jj!) jtj m k k1
;
: (4.20)
t6=0 0mR
CHAPTER 4. RIDGELET SPACES 57
'(t) = 0 0 (t)
where again, 0 is a normalized singularity of degree so that k0 k 1. From (4.19), one
can now check that we may write
X
f= ak k (uk x ; bk )
k
where ak = k if jk 0 and ak = 0k if jk = ;1. Now
X
jak j = X jk j + X jk j max( )2(2
0 0
);d kf kR11+1+(d;1)=2 :
k k:jk =;1 k :j k 0
Ave
u
kRu(/d) 'k1 C and sup
j 0
2j (+d) Ave
u
kRu(/d) j k1 C:
The basic calculation that we have already used gives
Z
Ru (/d )(t) = (t) (y)(u1 x1 ; (1 ; u21 )1=2 y ; b) dy
Z
= ;
(t)(1 u21 )=2 (y)~ (u1 x1 =(1 u21 )1=2 ; ; y ; b=(1 ; u21)1=2) dy
= ;
(t)(1 + v2 );=2 ( ~ )(vt (1 + v2 )1=2 b)
where we recall that v = u1 =(1 ; u21 )1=2 . Here, ~ is of course a normalized singularity of
CHAPTER 4. RIDGELET SPACES 59
Now, a calculation in Donoho (1993) shows that the homogeneous Besov norm k kB_ 11+1 is
uniformely bounded over all the normalized singularities of degree . Hence, for jvj 2j
we have
kRu(/d) j k1 C 2 ;j (1+)
:
For jvj 2j , the picture is slightly di erent. Let s be greater than +d, e.g take s = 1++d
which implies
kRu(/d) j k1 C 2 ;j (+1+d)
(1 + v2 )d=2
because of a previous observation. Averaging our two estimates yields (recall that the
CHAPTER 4. RIDGELET SPACES 60
C 2 2 +2 2
C 2;j (+d)
which is what needed to be shown. Of course, we also have for the coarse scale ( denoting
the 1-dimensional gaussian density)
Ave
u
kRu/d 'k1 Ave
u
kRu/dk1k'k1 Ave
u
k/dk1k'k1
= Ave
u
k( ; b)k1k'k1 C k( ; b)k1 C:
Theorem 7 is now completely proved.
Chapter 5
Approximation
This chapter shows how we can apply the ridgelet transform to derive new approximation
bounds: that is, we derive both a lower bound and an upper bound of approximation for
our new functional classes. These two bounds are essentially the same and show that in a
sense, the ridgelet-dictionary is optimal for approximating functions from these classes.
61
CHAPTER 5. APPROXIMATION 62
might be represented as
X
f= hf i ~ = X hf ~ i : (5.3)
2;d 2;d
To simplify, will denote the ridgelet coe cient hf i. Finally, f~N will denote the N -
term dual-ridgelet expansion where one only keeps the dual ridgelets corresponding to the
N largest ridgelet coe cients. That is,
X
f~N = 1fj j jj(N ) g
~ : (5.4)
2;d
f with a singleton: d1 (f D) = 0". Thus when we say \reasonable dictionary," we have in
mind that one considers only sequences of dictionaries whose size grow polynomially in the
number of terms to be kept in the approximation (5.1){(5.2).
Remark. In Chapter 4, we considered a very natural class SH of objects having a special
kind of singularities (De nition 7) and showed that they almost correspond to balls in Rpqs
(Theorem 7) it follows from Theorems 7 and 9 that these classes have the same lower and
upper bounds, as the L2 approximation rate (5.5)-(5.6) does not depend on the parameters
p and q. Further, our result implies that thresholding the dual-ridgelet expansion is optimal
for approximating objects from these classes.
dN (F D (N ) ) K 0 N ;r :
In our situation, the construction of embedded hypercubes involves properties of the
frame. Roughly speaking, for a xed scale j , a subsequence of the frame ( ) 2;j properly
rescaled is an hypercube of the desired dimension. In order to prove this fact, we need rst
to establish a number of key estimates about the decay of the kernel h 0 i. We will
notice that some of these key estimates support a number of claims that have been made
in Chapter 4.
CHAPTER 5. APPROXIMATION 64
(Recall the convention about our notations introduced in section 3.6.1: ;1 (u x ; b) stands
for '(u x ; b).) Let Q be an orthogonal change of coordinates (x = Qx0 ) such that
p
u0 x = x01 and u x = (u u0)x01 ; 1 ; (u u0)2 x02 :
p
Finally, we set v to be u u0 = 1 ; (u u0 )2 . (We recall that for a xed u0 , and a uniform
distribution of u on the sphere, the density of v is proportional to (1 + v2 );d=2 .) With our
notations, the kernel (5.7) can be rewritten as
Z
Kw ( 0 ) = j ((1 + v2 );1=2 (vx01 ; x2) ; b)
0
j 0 (x1
0
; b ) w(Qx ) dx
0 0 0
Z
1=2 (vx ; x ) ; b)
= j ((1 + v2 ); j 0 (x1 ; b ) wQ (x1 x2 ) dx1 dx2
0 0 0 0 0 0 0 0
1 2
R
where wQ (x01 x02 ) is of course w(Qx0 ) dx03 : : : dx0d . It is trivial to see that wQ belongs to
S (R2) and that for all Q, and any n1 n2, there is a constant C (depending on w, n1 n2,
and m) with
@ n1 +n2 w (x x ) C (1 + jx1j + jx2j) ;2m
C (1 + jx1j) ;m
(1 + jx2 j);m :
@xn1 1 @xn2 2 Q 1 2
In this section we will assume that ' and satisfy a few standard conditions: namely,
both ' and are R times di erentiable so that for every nonnegative integer m there is a
constant C (depending on m) so that
dn (t) C (1 + jtj) ;m
dtn
CHAPTER 5. APPROXIMATION 65
for some constant C (depending only upon m and n) { and similarly for '. Further, we will
suppose that has vanishing moments through order D.
Lemma 10 Assume j 0 j and suppose n < min(R D). Then, for each n m 0, there is
a constant C (depending on n and m) so that:
(i) Suppose 2j (1 + v2 );1=2 1, then
jKw ( )j C 2
0 ;j 0 n ;jn
2 (1 + v2 )n+1=2 (1 + jb0 j);m (1 + jvb0 ; (1 + v2 )1=2 bj);m :
Proof of Lemma.
Case (i). The change of variables x01 = x1 x02 = u1 x1 ; u2 x2 allows rewriting the kernel
as
Z
Kw ( ) =0
R2
j (x2
0
; b) j 0 (x1
0
; b )wQ(x1 vx1 ; (1 + v2)1=2 x2) dx1dx2:
0 0 0 0 0 0
Now let w~Q be the function de ned by w~Q (x01 x02 ) = wQ (x01 vx01 ; (1 + v2 )1=2 x02 ) dx01 dx02 . It
is fairly clear that
@ n1 +n2 w~ (x0 x0 ) = C (1 + v2 )(n1 +n2)=2 (1 + jx0 j + jvx0 ; (1 + v2 )1=2 x0 j);2m
@x01n1 @x02n2 Q 1 2 1 1 2
where again, the constant C does not depend on Q. Let n be an integer. By assumption,
the wavelet is of class C R for R > n and has at least n vanishing moments. Then from
standard wavelet estimates, it follows that one can nd a constant C such that
jKw ( )j C 2
0 ;j 0 n ;jn
2 (1 + v2 )n+1=2 (1 + jb0 j + jvb0 ; (1 + v2 )1=2 bj);2m
so that Kw ( 0 ) =
R (2j 0 (x1 ; b0 ))g (x1 )dx1 . For any n, there is a constant C such that
for any ` n
d` (u x ; u x ; b) C 2j(`+1)(1 + 2j ju1x1 ; u2x2 ; bj) ;m
dx`1 j 1 1 2 2
and
@ n;` w(x x ) C (1 + jx1j + jx2j) ;2m
C (1 + jx1j) ;m
(1 + jx2 j);m :
@xn1 ;` 1 2
Now the result from Chapter 4 guarantees that
dn g (x ) C 2j(n+1)(1 + jx1j) ;m
(1 + 2j ju1 x1 ; bj);m :
dxn1 1
And again standard wavelet estimates give in this case:
Z
jKw ( )j
0
= j 0 (x1 ; b )g (x1) dx1
0
Proof of Corollary. The proof is absolutely straightforward as one just needs to integrate
the upper estimates obtained in Lemma 10.
Corollary 5 Let j 0 j and suppose n < min(R D). For 2n > d=p ; 1, there is a constant
C (depending on n and p) so that
Z 1=p
jK( )jp db du
0 0 0
C 2;(j0 ;j)n2j 2;jd=p
Z 1=p
jK( j 0
) p db du C 2;(j0 ;j)n 2j 2;jd=p:
which established the rst result. The second result is similar in every point.
From Corollary 5, one can deduce the result announced in Chapter 4 (Lemma 7).
Proposition 9 Let s > 0 and 0 < p q 1. For any j 0, u 2 Sd;1 , b 2 R, if min(R D)
is large enough, then
Z 1=p
kw(x) j (u x ; b)kRspq = jKw ( j 0
) p db0 du0
( Z 1=p)
+ 2j 0 s2j 0 (d;1)=2 jKw ( j 0
) p db0 du0
`q
at these points don't overlap? The maximum number of points we can distribute is called
the packing number. Again, there is a considerable literature (Conway and Sloane, 1988)
on this matter that the reader can refer to. In the sequel, we shall only make use of trivial
facts about this packing problem.
The purpose of this section is to establish a technical fact that would guarantee that
a certain functional class cannot be approximated by nite linear combinations of a given
dictionary D faster than a certain rate. It is important to note that the proof of this fact
(or in other words of the existence of lower bounds of approximation) does not need to be
constructive. This observation greatly simpli es our argument.
For a xed j , let Sj be a set of points on the sphere fui g satisfying the following
properties.
(i) 8ui ui0 2 Sj kui ui0 k 2;(j ;j0 ) .
(ii) jSj j B1 2(j ;j0 )(d;1) .
(iii) For any u 2 S d;1 , and all 0 l j ; j0 ,
Z
jfui 0 (1 ; (juu uuiij)2)1=2 1gj B2 2(j ;j0 )(d;1) dv
Zjvj1 (1 + v ) dv
2 d=2
In the above expression, the constants B1 and B2 can be chosen to not depend on j and j0 .
We remark that the rst property (i) implies that
Indeed let, vii0 = ui ui0 (1 ; (ui ui0 )2 );1=2 . Suppose for instance that vii0 2j ;j0 . We have
!
kui0 ; uik2 = 2 1; vii0
(1 + vii
2 0 )1=2
= 2 1 (1 +1v2 0 ) :
(1 + vii
2 0 )1=2 (vii0 + (1 + v2 0 )1=2 )
ii ii
Therefore, vii0 2j ;j0 implies kui0 ; ui k < 2;(j ;j0 ) . From (i), it follows that it is equivalent
to i = i0 . The argument is identical in the case vii0 ;2j ;j0 .
To simplify the analysis, suppose 2 S (R) compactly supported supp $;1=2 1=2],
and has a su ciently large number of vanishing moments. We normalize such that
k k2 = 1. Further, letpw 2 C01(d) be a radial window such that 0 w 1 and w(x) = 1
for any x with kxk 3=2. We now consider the set Aj of windowed ridgelets at scale j
Proof of Lemma. The norm of fik being clearly invariant by rotation (w radial), one can
CHAPTER 5. APPROXIMATION 71
p
jx1 j 2=2
;
2j 2 (2j (x1 k2;j ))dx1 2 2
x2 +:::xd (1=2)2
1 dx1 dx2 : : : dxd
= kk 2c = c
2 d d
where cd might be chosen to be the volume of a d ; 1 dimensional ball of radius 1=2. This
proves (i).
Before proceeding further, observe that if 0 < 1, x 2 R, y 2 R, and > 0 we
have
X
(1 + jx ; kj);1; (1 + jy ; kj);1; C ;1
(1 + jy ; x j)
;1 ;1;
: (5.11)
k2Z
for some new constant C2 (d), depending only on d, and w. Summing over ui0 (ui0 6= ui )
CHAPTER 5. APPROXIMATION 72
where again C4 (d) is a new constant C (d w). (Notice that we have sacri ced exactness
for synthetic notations: in the second line of the array, read jfui0 0 jvii0 j 1gj instead
of jfui0 2`;1 jvii0 j 2` gj when the index ` equals 0.) Therefore by choosing j0 large
enough, one can make sure that the quantity Cd 2;j0 2d is dominated by cd , which proves
(ii).
Lemma 12 Let C the parallelepiped be dened by
X
C = ff f= ik fik jik j 1g: (5.12)
ik
Finally, as to simplify notations, we show the result for p = 1 as the argument for p 6= 1 is
absolutely parallel, we will just indicate where to modify the proof to handle this case. We
CHAPTER 5. APPROXIMATION 73
have
ZX
jwj0 (u b )(f )j
0 0
= ik fik (x)2j 0 (2j0 (u0 x ; b0 ))dx
ik
X Z
jikj 2j=2 (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx
ik
XZ
2j=2 (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx
ik
To be consistent with the preceding section and to simplify the notations, let Kw ( ik 0 )
be de ned as
Z
Kw ( ik
0
)= 2j (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx:
(Because of an exigence of consistency, we want to bring to the reader's attention that the
normalization has changed, i.e. 2j=2 has been replaced by 2j .) Once again we apply our
fundamental estimates. Let vi be ui u0 =(1 ; (ui u0 )2 )1=2 . Suppose for now that j 0 j , we
have for vi 2j
jKw ( ik )j C 2
0 ;j 0 n ;jn
2 (1 + vi2 )n+1=2 (1 + jb0 j);2 (1 + jvi b0 ; (1 + vi2 )1=2 k2;j j);2 :
and, similarly, for the interval 0 jvj 1. Repeating the argument of the previous lemma
CHAPTER 5. APPROXIMATION 74
gives
X X jX
;j0
jKw ( ik )j 0
C2 ;j 0 n ;j (n;1)
2 B 2(j ;j0 )(d;1) 2`(2n;(d;1)) (1 + jb0 j);2
ijvi j2j;j0 k `=0
C 2;j0 n2jn2j (1 + jb j)
0 ;2
:
Now for vi 2j ;j0 , since our estimate for vi 2j dominates the one for vi 2j , one can
check that
2 1=2
!;2
jKw ( ik )j C 2
0 ;j 0 n
2j (n+1) (1 + jb j)
0 ;2
1 + 2j b ; (1 + vvi ) k2;j
0
i
Now, we know that jvi j 2j ;j0 implies ku0 ; ui k 2;(j ;j0 ) and, similarly, when the index
i is replaced by i0 . Suppose S = fi jvij 2j;j0 g 6= and let i0 be one of its elements.
Then, by the triangular inequality jvi j 2j ;j0 =) kui0 ; ui k 2 2j ;j0 , it follows that
the cardinality of S can be bounded by a constant depending at most on the dimension d.
Therefore, we have
X X
jKw ( ik )j C 2 0 ;j 0 n
2j (n+1) (1 + jb0 j);2 :
ijvi j 2j;j0 k
giving
X
jKw ( ik )j C 2
0 ;jn
2j 0 (n+1) (1 + jb0 j);2 :
k
Again, summing over the i's yields
X X
jKw ( ik )j2 0 ;(j ;j 0 )(n;d)
2j (1 + jb0 j);2 :
ijvi j 2j0 k
In the case p < 1, we just need to replace (1 + jb0 j);2 with (1 + jb0 j);2=p at each of its
occurence in (5.13) as it follows from the same argument using, however, Lemma 10 with
sharper bounds. A direct consequence of the above inequalities together with the preceding
CHAPTER 5. APPROXIMATION 76
`q
by ;1 . Recall,
which gives
X
k(G + I );1 k(`1 `1) (1 + );1 (kI k(`1 `1) + (1 + );k kF k k(`1 `1) )
k 1
X
;1
(1 + ) (1 + (1 + );k kF kk(`1`1))
k 1
(1 + );1 1 ; kF 1k :
(`1 `1)
Finally,
kH k(`1`1) 1 Z 1 (1 ; kF k ;1 ;1 ;1=2
d = (1 ; kF k(`1 `1 ) );1 :
(`1 `1 ) ) (1 + )
0
By assumption we have kF k(`1 `1) 1 ; implying kH k(`1 `1 ) ;1 , which is precisely
what needed to be proved.
We are now in a position to state the main theorem of this section.
Theorem 11 Suppose 1 p q 1. Let Rpq
s (1) be the unit ball of R s (2d ). Then there
pq
s (1). That is we
exists an hypercube of sidelength and dimension m() embedded in Rpq
CHAPTER 5. APPROXIMATION 78
can nd m() orthogonal functions gim i = 1 : : : m() with kgim k2 = , such that
X
H( fgi g) = ff f= igim jj 1g s (1)
Rpq
i
and
1
m() K; s=d+1=2 :
Proof of Theorem. The proof is a mere consequence of the three preceding preparatory
lemmas. In view of Theorem 10, this is exactly what needed to be shown to prove the rst
part of Theorem 9.
The norm (5.14) is the discretized version of the continuous norm (4.3).
For compactly supported functions, we have the following lemma.
Lemma 14 There is a constant C possibly depending on s p q such that
kkrspq C kf kRpqs : (5.15)
We prove the lemma for the subset of triplets (s p q) de ned in the rst paragraph of this
section.
Proof of Lemma. Without loss of generality, we may assume that supp f d . For a
CHAPTER 5. APPROXIMATION 79
frame of L2 (d ), let A be the Analysis operator de ned by Af = ( )d 2;d . To prove the
lemma one has to prove that
X Z
+ 2j 0 d Rj 0 (u0 b0 )(f )h2j 0 =2 (2j0 (u0 x ; b0 )) w idu db :
0 0
j0 0
In view of the Parseval relationship (Proposition 2), the above relationship is fully justi ed
s for s > d(1=p ; 1=2)+ implies (for compactly supported functions)
as the membership to Rpq
membership to L2 (see last paragraph of section 4.2.1). Then, observe that
X Z X
j j jR ;1 (u0 b0 )(f )j jh(u x ; b ) w ijdu db
0 0 0 0
2;j
X Z2;j
+ 2j0d jRj0 (u b )(f )j X jh2j0=2
0 0
(2j 0 (u0 x ; b0 )) w ijdu0 db0
j0 0 2 ;j
2;j ;1j 0 j
X Z
+ 2j 0 d 2;(j ;j 0 )(n+1=2) jRj0 (u b )(f )jdu db
0 0 0 0
j 0>j
CHAPTER 5. APPROXIMATION 80
Then,
X X Z
2js 2;jd=2 j j 2;(j ;j 0 )(n;d=2;s;1=2)
2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db
0 0 0 0
2;j ;1j 0 j
X Z
+ 2 ;(j ;j 0 )(n+d=2;s;1=2)
2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db
0 0 0 0
j 0 >j
and from there, it is then easy to see that for any q > 0
X Z
kk krs1q (2js 2;jd=2 j j)j k`q k (2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db )j0 k`q = kf kRsq :
0 0 0 0
1
2;j
(The cases where q 1 and q = 1 are clear, and interpolation does the rest.)
Case p = 1. In this case, we have
Z
j j jR ;1 (u0 b0 )(f )jjh'(u0 x ; b0 ) w(x) (x)ijdu0 db0
X Z
+ 2j 0 d jRj (u b )(f )jjh2j0=2
0 0
(2j 0 (u0 x ; b0 )) w(x) (x)ijdu0 db0
j0 0
Z
kR ;1 ( f )k 1 jh'(u x ; b ) w(x) (x)ijdu db
0 0 0 0
X j0d Z 0
+ 2 kRj 0 (f )k jh2j =2 (2j0 (u x ; b )) w(x)
1
0 0
(x) ijdu db 0 0
j0 0
Now again, it is clear that the above inequality gives for any q > 0
Z
kkrs1q = k(2js2jd=2 sup;j j j)j k`q k(2j0s2j0d=2 jRj0 (u b )(f )jdu db )j0 k`q = kf kR1s q :
2
0 0 0 0
(5.16)
Case p = q = 2. This case is a direct consequence of Theorem 6.
Extension. It seems reasonable to suspect that one can extend Lemma 14 by inter-
s is a weighted space of the type `q (Lp ): using the
polation. By de nition, the space Rpq
s norm of an object as
notations of the lemma, one can rewrite the Rpq
0 Z q=p11=q
kf kRpqs = k2js2jd=2kRj (f )kLp(Sd; R)k`q = @ X
1
Sd;1 R
jRj (u b)jpdudb A
j ;1
with obvious modi cation in the case p = 1 or/and q = 1. (Again, there is a full analogy
with the Besov scale replace Rj (f ) with f 'j and the weights 2js 2jd=2 with 2js .) Similarly,
the sequence space rspq is a weighted space of the type `q (`p ). From Lemma 14, we know
that for any s q > 0, the operator A is bounded from R1sq (respectively, R1 s ) to rs
q 1q
(respectively, rs1q ). Now, the sequence spaces rspq are clearly interpolation spaces with
well-known properties and one expects the same interpolation properties to be true for the
s . If that were true, one would be able to prove Lemma 14 for the full range of
spaces Rpq
parameters s > 0, 1 p 1, q > 0 { at least when Rpq s L2 (d ) { since bounds at
the \corners" are already established. Work in this direction is in progress and we hope to
report on it shortly.
Further, we have learned that in the case where p = q = 2, the norms kkrs22 and kkR2s2
were actually equivalent therefore, this favors the conjecture we have just brought up and
leads to a more ambitious one: namely, the norm equivalence between kkrspq and kkRpq s .
Lemma 14 has a very interesting corollary. Let us consider a frame like in section 3.6.1
such that both ' and are compactly supported and that has enough vanishing moments
so that Lemma 14 holds. The assumption of compact support will simplify the proof of the
corollary but is not a necessary hypothesis.
Corollary 6 Assume s > d(1=p ; 1=2)+ and let p be dened by 1=p = s=d + 1=2, then
CHAPTER 5. APPROXIMATION 82
s (d ) we have
for f in Rpq
To simplify, will denote the ridgelet coe cient hf i. Finally, f~N will denote the N -
term ridgelet expansion where one only keeps the terms corresponding to the N largest
coe cients. That is
X
f~N (x) = 1fj j jj(N ) g
~ : (5.19)
2;d
We recall that for a sequence (n ), the weak-`p or Marcinkiewicz quasi-norm is de ned
CHAPTER 5. APPROXIMATION 83
as follows. Let jj(n) be the nth largest entry in the sequence (jn j) we set
jjw`p = sup
n>0
n1=p jj(n) : (5.20)
Lemma 15 Let 0 < p < 2 and suppose that the sequence of coecients ( )2;d has a
bounded weak-`p norm. Then there is a constant C (depending at most on p) such that
kf ; f~N k2 CN r jjw`p
;
Then, the frame decomposition (6.7) tells us that, rst, SA is the identity of L2 (d ), and,
second, K~ = AS is the orthogonal projector onto the range of A. Again, let = hf i
be the ridgelet coe cients of f and (N ) be the truncated sequence of the N th largest
coe cients, i.e. (N ) = 1fj j jj(N ) g . Then, of course, f~N = S(N ) . Now the frame
property gives
= B k ; K~ (N ) k` (;d ) 2
B k ; (N ) k` (;d) 2
since the norm of K~ is 1. Now, Lemma 1 in Donoho (1993) allows us to conclude that
jR (f )j C 2 ;j (n+1=2)
(1 + v2 )(n;)=2 (1 + j(1 + v2 )1=2 k2;j j);n :
jR (f )j C 2 ;j (+1=2)
(1 + jkj);n :
We are going to show that, for a nice frame, the ridgelet coe cients of f are in `p for
any p > 0. So, let p > 0 be xed. Our estimates imply that for (1 + v2 )1=2 2j ,
X
jR (f )jp C p 2 ;jp(n+1=2)
(1 + v2 )p(n;)=2 2j (1 + v2 );1=2
k
if n is chosen large enough so that ( ; n)p < ;1. Next we recall that for s 0
X
(1 + vi2 )s=2 C max(2js 2j (d;1) )
i2Ij :(1+vi2 )1=2 2j
(Note that for ;1=2, f is not square integrable.) Various extensions of results of this
kind are, of course, possible.
Chapter 6
The Case of Radial Functions
In Chapter 5 we have seen that ridgelets are optimal to represent functions that may
be singular across hyperplanes when there may be an arbitrary number of hyperplanes
with any orientations and locations. A natural question one may ask: can we curve these
hyperplanes (singular sets)? In other words, how are good ridgelets to represent objects
being singular across curved manifolds? To answer this question, the study of radial objects
is particularly attractive for it is both enlightening and simple. We will give accurate degrees
of approximation of radial functions by ridgelets. Finally, we will establish a remarkable
result: the rates of approximation by ridgelets to radial functions which are smooth away
from spheres are identical to the rates achieved by wavelets.
The analysis will illustrate why ridge functions (neural networks) are not free from the
curse of dimensionality.
Again, we shall restrict our attention to the case of compactly supported objects.
87
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 88
so that Ru f (t) = jS d;2 j(Td ')(t). The operator Td is an integral operator whose kernel is
r(r2 ; t2)(+d;3)=2 from (6.2) one sees easily that for t 0, we have
p 1 Z p
(r ; t)(+d;3)=2 '( r) dr:
1
(T ')( t) =
d 2 0
Therefore, if we let Ud be the convolution operator de ned by
Z1
(Ud g)(t) = (r ; t)(+d;3)=2 g(r) dr (6.3)
0
p
and C be the change of coordinates g(t) 7! g( t) (C ;1 corresponding of course to g(t) 7!
g(t2 )), we have
Td ' = 12 C ;1 Ud C':
Now, suppose that we rst look at functions vanishing in a neighborhood of the origin.
For instance, let > 0 such that 'j0] = 0. We have for 1 p q 1
kTd'kBpqs d; = Ck'kBpqs
+( 1) 2
where, in addition to , the constant C depends on s p q. To see this, observe that for
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 89
p
supp g $ 1], the operator C : g(t) 7! g( t) is bounded from Bpq s (0 1) to itself for any
choice of s p q. In addition, C ;1 : g(t) 7! g(t2 ) is bounded from Bpq
s (0 1) to B s (;1 1),
pq
again for any possible triplet s p q. The only thing remaining to prove is that Ud is bounded
s+(d;1)=2 (0 1). This is not very di cult and is shown in the Appendix.
s (0 1) to Bpq
from Bpq
As a matter of fact, we probably do not need the additional assumption ' to vanish in
a neighborhood of the origin.
Proposition 11 Suppose ' is an even function such that ' 2 Bpq
s (;1 1). Then Td ' 2
s+(d;1)=2 (;1 1) and
Bpq
kTd'kBpqs d; = (
+( 1) 2
;11)
C k'kBpqs ( ;11) (6.4)
for some constant C depending at most on s p q and the dimension d. Moreover, suppose
in addition that ' vanishes in $; ] then we have the reverse inequality
kTd'kBpqs d; = (
+( 1) 2
;11)
C k'kBpq
s (;11) (6.5)
and, therefore,
in the sense of equivalent norms. Now suppose further that f is compactly supported, say,
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 90
in the interval $;1 1]. For p 1, we have the Poincar'e inequality which states that
kf kp C (p)kf kp
0
kT3'kBpqs C k'kBpqs
+1
d
dt (Td+2 ')(t) = ;t(Td ')(t) for t 6= 0:
Therefore, the same argument as the one for d = 3 yields (6.4) for d + 2.
We turn to the proof of the lower bound (6.5). Let be a C 1 function such that
0 1, = 1 whenever jtj 1, and = 0 for jtj 1=2 (the notation will
stand for (= )). Now, there are well-known inversion formulas for the Radon transform.
In particular, when the function f is radial (f (x) = '(kxk), it is easily established that
(Natterer, 1986)
kLgkBpqs d; C kgkBpqs d; :
+ 2 + 1
To see this, recall that the di erentiation operator dtd is a bounded map between Bpq
s and
s;1 for any ;1 < s < 1, as it can be obtained as a corollary of two theorems from
Bpq
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 91
Triebel (1983, Pages 57{61). Further, the multiplication by 1=t is of course a bounded
s;1 restricted to its elements supported in, say, f=4 jtj 1g supp g.
operation in Bpq
By induction, we get that
But of course, it follows from (6.6) that for jtj =2, we have Ld ;1 g(t) = '(t) implying
' = Ld;1g. Therefore, to summarize, we have
where the last inequality comes from the rst part of the proposition. This completes the
proof of the proposition.
supported. Following Chapter 2, we will denote the element of the frame for 2 ;d .
We know that any element of, say, L2 (d ) might be represented as
X
f= hf i ~ = X hf ~ i : (6.7)
2;d 2 ;d
To simplify, will denote the ridgelet coe cient hf i. Finally, let f~N be the N -term
ridgelet expansion where one only keeps the terms corresponding to the N largest coe cients
(5.19). We have the following theorem:
Theorem 12 Let f be a radial function supported in the unit ball f (x) = '(kxk) such that
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 92
' 2 Bpq
s . Suppose that s > d(1=p ; 1=2)+ . Then we have the L2 error of approximation
kf ; f~N k2 C N s=dk'kBpqs :
;
(6.8)
Remark 1. In fact, the result (6.8) is probably sharp: that is for any s0 > s, there must
be a function ' 2 Bpq
s with, say, k'kB s = 1 such that
pq
At the end of this section, we will argue that there is some strong evidence supporting this
conjecture. This suggests that neural networks are also subject to the curse of dimension-
ality: to obtain approximation to f at rates N ;r , we must assume that f is r d times
di erentiable.
Remark 2. We remark that the rate of convergence is identical to the one obtained
using truncated wavelet expansions. This remarkable fact will be examined further in the
discussion section.
Proof of Theorem. We start by proving the upper bound. Notice that by construction
the collection
is a frame of L2 $;1 1]. Now, for an univariate function g : R ! R, let jk = hg jk i and
k = hg k i. Although we do not give a proof here, the quantity
X
kf kbspq k(k )k k`p + k(2j 2j(1=2 ;1=p)
( jjkjp)1=p)j 0k`q (6.10)
k
s , for any g supported in $;1 1].
is an equivalent norm to the norm of f in Bpq
As we suppose that f (x) = g(kxk) is supported in the unit ball, we recall that
For a g in Bpq s , we know that its image Td g is in Bpq s+(d;1)=2 in addition, its norm
kTdgkbspq+(d;1)=2 is bounded by kgkBpqs . Further, it is fairly clear that for any q > 0,
kTdgkbsp1d; = kTdgkbspq d; = C kgkbspq :
+( 1) 2 +( 1) 2
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 93
Following Donoho (1993), one may bound the number of coe cients, at a given scale j ,
whose absolute values exceed a certain threshold > 0 as displayed below:
Thus, the number of ridgelet coe cients at scale j that exceed in absolute value is
larger than > 0 is bounded as follows:
since the number of directions uji is bounded by C 2j (d;1) for any scale j . Summing (6.11)
over scales j , a bit of algebra shows that (provided s > d(1=p ; 1=2)+ )
where C possibly depends upon d s p and the sampling resolution b0 . Let p be de ned
by 1=p = s=d + 1=2. The inequality (6.12) merely states that the weak-`p norm of
is bounded by kTd gkbsp+(1d;1)=2 up to a multiplicative constant. The supporting Lemma 15
leads to the desired conclusion.
Of course, one would like to have a converse statement or, in other words, give lower
bounds of the squared error of approximation kf ; f~N k2 . In the remarks directly following
the statement of our theorem, we have articulated a reasonable conjecture (6.9) expressing
the sharpness of the bound (6.8). Unfortunately, we are not able to prove this conjecture
yet. However, we have available a recipe for constructing radial functions whose ridgelet
coe cients sequences are not sparse. For instance, let p0 be a real such that 1=p0 > 1=p =
s=d + 1=2. Making use of Proposition 11 and Lemma 2 in Donoho (1993), one can nd a
' 2 Bpq
s (k'kB s = 1) such that the sequence of ridgelet coe cient is not in w`p0 . Now,
pq
the problem is that we don't have a converse to Lemma 15: that is, if we know something
about the rate of approximation by nite linear combinations of ridgelets (or their duals),
does it imply that the ridgelet coe cients have a certain decay? Let us be even more precise.
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 94
P
Suppose f is such that there is a sequence fN = Ni=1 iN i N with the property
kf ; f~N k2 C N ;r
:
Does it imply that fhf ig 2 w`p (;d ) with r = (1=p ; 1=2)+ ? In Approximation Theory,
this is often referred to as a Bernstein type of inequality whereas Lemma 15 is a kind of
abstract Jackson inequality. Suppose the ridgelets were orthogonal then the Bernstein
inequality would be trivial. The delicate issue here is, of course, that the ridgelets are
possibly linearly dependent. Work in this direction is in progress.
6.3 Examples
In this section, we consider a class of radial functions (f ) de ned by f (x) = (1 ;kxk2 )+ .
Away from the sphere of radius r = 1, these functions are smooth but are singular across
this same sphere. Here, the index parameter is simply the degree of the singularity (see
De nition 6). In all that follows, we will suppose > ;1=2, so that f is square integrable.
First, a simple calculation shows that
Z1
Ru f(t) = jS d;2 j t (r2 ; t2)(d 3)=2 (1 ; r2)r dr
;
= 2
; ;
0
= cd (1 ; t2 )+(d 1)=2 :
;
We now check the sparsity of the representation of f in a ridgelet frame. Again, suppose
that we are using ridgelets such that their pro les ' and have compact supports. We
will assume that ' and are R times di erentiable and that has vanishing moments
through order D. Finally let p be de ned by 1=p = 1=2 + ( + 1=2)=(d ; 1). We
show that the coe cients are in w`p . If min(R D) is large enough (min(R D)
max(2=p 2 + + (d ; 1)=2) su ces), we have
Z
= hf i = cd (1 ; t2 )+(d;1)=2 2j=2 (2j t ; kb0 )
;
Cd 2;j (1=2++(d;1)=2) 1 + jkb0 j ; 2j ;2=p
;
= Cd 2;j (d;1)=p 1 + jkb0 j ; 2j ;2=p : (6.13)
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 95
The proof of the rst inequality is an application of integration by parts and we omit it.
So, let > 0 and let us bound the number of coe cients that are greater or equal to
in absolute value. For a given scale j and orientation uji , it follows from (6.13) that
where the constants may depend on the parameter b0 . Moreover, (6.13) implies that
#fk j j g = 0 whenever j verify
#f j j g = O( ;p
)
Suppose one were to use wavelets to approximate the singular function f from the above
) ( = (j k) j 0 k 2 Zd ) be a nice isotropic wavelet basis in
Proposition. So let ('k jk
Rd with compact support, enough regularity and vanishing momemts. Standard techniques
show that for any > 0, we have
approximation converges at the rate O(N ;(+1=2)=(d;1) ). In this case, it can be shown that
this rate cannot be improved.
6.4 Discussion
In this chapter, we have demonstrated in several places that ridgelets and multi-dimensional
wavelets were equally as good, at least asymptotically, to approximate radial functions. For
instance, in the previous section, we have shown explicitly that the rate of approximation
of, say, f (x) = (1 ; kxk2 )+ , for > ;1=2 using ridgelets coincides with the one obtained
using multi-dimensional wavelets { regardless of the degree of the singularity and/or of
the dimension d. In fact, this conclusion is not limited to the case of radial functions. For
instance, one could consider a simple extension of our radial example: let F be the set of
all di eomorphisms h : Rd ! Rd , with, say, h 2 C r for some r > 0 and such that all
the partial derivatives of h up to order r are bounded by some constant C . Now, for any
h 2 F , consider g = f h. So, g is a smooth function away from a hypersurface ) with
a minimum of curvature and singular across ) (roughly speaking, one sees a singularity of
degree when one crosses ) along a normal vector to the surface). It then turns out that if
r is large enough, the conclusion of Proposition 12 remains unchanged for this larger class
of functions. That is, the rate of approximation of g by ridgelets is O(N ;(+1=2)=(d;1) ) and
so is the rate of convergence of wavelet approximations.
In the author's opinion, this result is most remarkable and surprising. The two dis-
cussed approximation schemes are, of course, very much unlike: as one is building up an
approximation using oscillatory bumps at various scales whereas the other is using ridge
functions localized around strips of various scales. And yet, both seem to represent func-
tions that are smooth away from curved hypersurfaces, but that may be singular across
these hypersurfaces, with the same degree of accuracy.
We close this chapter by pointing out that our ndings seem to contradict { at least
at a super cial level { the results obtained by Donoho and Johnstone (1989). In fact, the
model they consider is somewhat di erent as they study the squared error of approximation
with respect to the gaussian measure whereas we are concerned with compactly supported
functions. Hence, their results are not inconsistent with ours. First, the behavior at in nity
of the functions they are treating may account for the divergence of our conclusions. Second,
their study only concerns the case of R2 and, perhaps, the e ects they describe are speci c
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 97
to this case.
Chapter 7
Concluding Remarks
7.1 Ridgelets and Traditional Neural Networks
The purpose of this section is to show that ridgelet approximations o er decisive superiority
over traditional neural networks. We recall that in the latter case one considers nite
approximations from DNN = f(k x ; b) k 2 Rn b 2 Rg, where is the univariate
sigmoid. As we have already shown, neural networks cannot outperform ridgelets over the
s (C ). However, outside of this paradigm, it is of interest to compare for any
classes Rpq
function f the rates of approximations using either DNN or DRidgelets .
In Chapter 5, we de ned the approximation error of a a function f by N -elements of
the dictionary D:
X
N
inf inf kf ; i gi kH dN (f D):
(i )Ni=1 (i )Ni=1 i=1
This measure is, indeed, insensitive to log-like factors. Now, one would like to know if there
exists a function f for which
98
CHAPTER 7. CONCLUDING REMARKS 99
The answer to this question is generally negative. In order to give a precise statement,
we only consider nite linear approximations with bounded coe cients: i.e., we restrict the
minimization (7.1) to ji j C for some positive C . In practice, this restriction makes perfect
sense since one is not going to store unbounded coe cients. Also, note that the relaxed
greedy algorithm (1.2)-(1.3) produces coe cients bounded by 1. With this restriction, one
can show
Theorem 13 For any given f ,
r (f DNN ) r (f DRidgelets ): (7.2)
in the sense of L2 (2d ). From the exact series displayed above, extract { as usual { the
nite sum 0 N corresponding to the N largest coe cients. The rst part of the proof
consists of proving that
for any s > 0. To prove the claim, we will try to use a maximum number of results proved
in the previous chapters.
Let us consider for a moment the Meyer wavelet basis (Daubechies, 1992) ('~0k ~jk )
on the real line. For scalars a0 > 0 and b0 , a0 b0 will denote the univariate function
CHAPTER 7. CONCLUDING REMARKS 100
(a0 (t ; b0)). As one can guess, the wavelet coe cients sequence of a0 b0 w is extremely
sparse. We claim that
(
C 2;j=2(1 + jk ; 2j b0 j); a0 2j
jha b jkij
0 0
Ca02;j 2;j=2 (1 + jk ; 2j b0 j); a0 2j
:
First, notice that (t) = (1 + e;t );1 = 1=2 + (t) where , according to De nition 6, is an
-singularity of degree 1 (up to a renormalizing constant). To see this, remark that j(t)j
is obviously bounded by jtj and an induction argument shows that for n 1,
dn (t) C (n)e ;jtj
dtn
which su ces to establish the singularity property. Then, for any positive a0 , a;0 1 (a0 (t ;
b0)) is also a singularity of degree 1. For a singularity of degree 1, we know that
jha0 1(a0(t ; b0)) jk ij C 2
; ;3j=2
(1 + jk ; 2j b0 j);
where the constant C may depend on . Then, of course, jk is orthogonal to the constant
function hence,
which is the part of the result corresponding to the case a0 2j . As far as the case a0 2j
is concerned, we observe that
d (t) = a 0 (a (t ; b ))
dt a0 b0 0 0 0
and, therefore, the above derivative is bounded in absolute value by a0 e;a0 jt;b0 j. Now since
has one vanishing moment, a simple integration by parts gives
jha b w jk ij C 2
0 0
;j=2
(1 + jk ; 2j b0 j);
which proves the second part of the claim. It follows from the previous inequalities that for
every choice of p, kk`p is nite and uniformly bounded (over all the possible choices of a0
and b0 ). Furthermore, it is trivial that k = ha0 b0 '~0k i satis es jk j k'k1 .
CHAPTER 7. CONCLUDING REMARKS 101
Now, back in terms of our frame decomposition, the property we have just shown is very
interesting since it implies that the sequence (0 ) 2;d in (7.3) is also uniformly bounded
(over all choices of 0 ) in `p. This merely follows from Chapter 5 (section 5.2.1) as it is a
consequence of Lemma 10. Then, Lemma 15 allows us to conclude and establish the validity
of (7.4).
Step 2. We are now in a position to nish up the proof of the theorem. We show that for
any r < r (f DNN ) there is a sequence fN of N -term ridgelet expansions with the property
kf ; fN k = O(N ;r
):
It is not di cult to construct such a sequence since by assumption we know the existence
P P
of a sequence gm (x) = mi=1 im (kim x ; bim ) (that we shall write gm = i im im ),
such that
kf ; gmk2 Cm ;r 0
with, say, r0 = (r + r (f DNN ))=2. For a xed N , let m be de ned by m = N 1; , where
(1 ; )r0 = r. Next de ne fN as follows:
X
m X
fN = im im N :
i=1
i=1
;r0
= O(m ) + mO(m;s )
= O(N ;(1; )r0 ) + O(N ;(1; )(s;1) ) = O(N ;r )
since s may be chosen arbitrarily large (in particular s ; 1 > r0 .) This nishes the proof of
the theorem.
Remark. In fact, the proof of the theorem shows that the assumption requiring the
coe cient i to be bounded may be dropped. The same result is true if we restrict the
coe cients to grow polynomially: that is, in the best approximation (7.1) we only require
jij BN for some constants B and .
CHAPTER 7. CONCLUDING REMARKS 102
In this section, we will refer to this class as the Barron class B (C ) and to k kB as the
Barron norm. His work shows that
where the Fourier dictionary is, of course, DFourier fei2 nx n 2 Zd g. Moreover, thresh-
olding the Fourier expansion (keeping the N terms corresponding to the N largest Fourier
coe cients) gives the optimal rate of approximation. In the next paragraph, we will show
how to use the tools developed in this thesis to prove that (7.7) is indeed a lower bound. As
far as the upper bound is concerned, observe that the Barron norm is actually equivalent
to
kf kB X knkjf^(2 n)j
n2Zd
as it is a simple consequence of a famous theorem due to Plancherel and P'olya (1938). Next,
a bound on the right-hand side of the above display implies a bound on the weak-`1+1=d
quasi-norm (5.20) of the sequence (jf^(2 n)j)n2Zd . Thus, Fourier series outperform neural
networks over the B (C ) model of functions.
In his analysis, Barron shows that the rate (7.6) may be obtained with expansions
CHAPTER 7. CONCLUDING REMARKS 103
P
of the form fN (x) = Ni=1 i (ki x ; bi ), where the i 's may be restricted to satisfy
PN j j C 0 for some constant C 0. Hence, a straightforward application of Theorem
i=1 i
13 gives at the minimum dN (B (C ) DDual;Ridge ) = O(N ;s ) for any s < 1=2. Moreover,
a bit of ridgelet analysis shows that the rate of approximation of the Barron class by any
reasonable dictionary (see Chapter 5, section 5.1) is bounded below by O(N ;1=2;1=d ). The
reason is the following: let us consider a pair (' ) satisfying the conditions (i)-(iv) listed
at the very beginning of Chapter 4 together with (2.7) namely,
j'^()j2 + X j ^(2 j )j2 = 1:
;
jjd 1 j 0 j2 j jd 1
; ; ;
Then we have,
Z
kf kB = kkkjf^(k)j dk
Z Z
R d
= jf^(u)jjjd ddu
>0 Sd;1
Z Z X j(d 1) Z Z ^ ^ j 2
=
>0 Sd;
jf (u)jj'^()j jjddu + 2
^
1
2
>0 Sd;
jf (u)jj (2 )j jj ddu
;
1
;
j 0
Z Z 1=2
j jjj
'^( ) 2 2 ddu jf^(u)j2 j'^()j2 ddu
X Z Z 1=2
+ 2j (d;1) j ^(2;j )j2jj2 ddu jf^(u)j2j ^(2;j )j2 ddu
0Z j 0 1=2 X Z 1=2 1
C @ jf^(u)j2j'^()j2 ddu + 2j(d+1=2) jf^(u)j2 j ^(2;j )j2 ddu A
j 0
= C kf kR2d=12+1 : (7.8)
Then, it follows from (7.8) that the Barron class B (C ) contains R2d=12+1 (C1 ) for some con-
stant C1 , and since the rate of approximation of the latter class is bounded below by
O(N ;1=2;1=d ) (Theorem 9), so is B (C ). We hope that this example emphasizes further the
use and the potential power of the ridgelet analysis that we have developed in this thesis.
CHAPTER 7. CONCLUDING REMARKS 104
where (respectively ~ ) stand for the ridgelet coe cients hf i (respectively the dual
ridgelet coe cientshf ~ i). A natural question is whether or not these two approaches
generally give the same results. That is, for a given f are the approximation errors kf ; f~N k
and kf ; fN k of about the same order? We know that
and, thus, the problem unquestionably takes place at the coe cient level. In this thesis, we
have mainly studied dual-ridgelet expansions because one has a direct access to the ridgelet
coe cients since the ridgelets are known explicitly. It is much more delicate to work out
similar estimates for the dual coe cient sequence.
In this direction, it would be interesting to know if both sequences have the same
structure that is, let k k be a norm on the sequence space { like (5.14) for example. Is
the norm based on the ridgelet coe cient sequence equivalent to the one based on the dual
sequence? Especially, in view of the previous display, one would like to know, for example,
if for a xed p > 0,
the aim being to transfer knowledge about the size and/or organization of to ~ . In fact,
one direction of (7.9) turns out to be easy. To x ideas, suppose we are given a frame of
L2 (2d ) and a function f in L2(d ), such that ~ 2 `p. Let 0 w 1 be a C 1 window,
supported in 2d , so that its restriction to d is 1. From the frame decomposition
X X
f = fw = ~ = ~
2;d 2;d
it follows that
X X
= hw i
0 ~ 0 and ~ = hw ~ ~0 i0 :
0 2;d 0 2;d
Now, the estimates obtained in section 5.2.1 imply that for any p > 0,
kk`p Cp k~k`p
provided that has a su ciently large number of derivatives and vanishing moments. To
establish the converse, it would be su cient to show that, say,
X
w ~ = 0 0
0
with
X X
sup
0
j 0 jp _ sup0
j 0 jp Cp
which expresses that the dual frame has a sparse representation in the original one. The
author has proven some results in this direction but they are too fragmentary to be reported.
Following ideas developed in Donoho, Johnstone, Kerkyacharian, and Picard (1995) and
Donoho and Johnstone (1995), we expect to derive the asymptotics of R (as ! 0) for
s . More importantly, consider estimators of the form
the classes Rpq
X
fb = (h Y i) ~
2;d
where ( ) 2;d is a nice ridgelet frame and where the functions are scalar nonlinearities
(hard/soft-thresholding, etc.) depending upon and the parameters s p q. We then ex-
pect such estimators to be adaptively nearly minimax, possibly within log-like factors, for
estimating objects from these classes. In short, thresholding the noisy ridgelet coe cients
should be nearly optimal for the estimation of objects exhibiting the spatial inhomogeneities
described in Chapter 4.
107
APPENDIX A. PROOFS AND RESULTS 108
same avor.
Z XX j ZZ X
jf^( ij )j2j ^(2 j )j2 ; 2 20
;
02 ] j
jf^( )j2j ^(2 j )j2dd
;
(A.2)
j i
ZXZX 2j Z
jf^( ij )j2j ^(2 j )j2 ; 2
;
0 02 ]
j f^( )j2 j ^(2 j )j2 d d
;
j i
Z X sZ sZ !
02 ]
j jjf^( )j2 j ^(2 j )j2 d
;
02 ]
kr f^( )k2 j jj ^(2 j )j2 d
; d
j
Z 0sZ X sZ X
1
@ 02 ] jjjf^( )j2 j ^(2;j )j2d 02 ] krf^( )k2jj j ^(2;j )j2dA d
j j
sZ Z X s Z Z X
02 ]
j jjf^( )j2 j ^(2 j )j2 dd
;
02 ]
kr f^( )k2 j j j ^(2 j )j2 dd:
;
j j
P
Adding the coarse scale term amounts to substituting j j ^(2;j )j2 by j'^( )j2 + j j ^(2;j )j2
P
in the last inequality. Recall that by assumption we have
j 0
and suppose
X
sup j'^( )j2 + j ^(2 j )j2 B:
;
The last inequality of (A.2) together with the conditions on ' and imply
Z XX j Z
jf^( ij )j2j ^(2;j )j2d ; 22 0 02 ]
jf^( )j2j ^(2 j )j2dd
;
j
Z iX X
= jf^( ij )j2j ^(2 j )j2d ; 2 20 kf^k2
;
sZj i
sZ
B jjjf^( )j2dd krf^( )k2 jjdd
Bp2kf^k22kf^k2 = 2p2Bkf^k22: (A.3)
APPENDIX A. PROOFS AND RESULTS 109
j i
XX 1 sZ sZ
2 R
jf^( ij )j2j ^(2 j )j2d
;
R
jf^( ij ))j2j2 j j2j ^(2 j )j2d
; ;
sj iZ XX sZ X X
21 R j
j f^( ij ) 2 j j ^(2 j;j ) 2 d
R j
jf^( ij )j2j2 j j2j ^(2 j )j2d:
; ;
i i
for some constant B 0 only depending on . Therefore, we can deduce from all this that
X 1 Z XX
j j ; 2
2
b0 R jf^( ij )j2j ^(2 j )j2d C 10 kf^k22
;
(A.4)
j i
for some C .
Finally, combining (A.3) and (A.4) gives
X
j j2 ; (2 2 kf^k2
)2 b0 0 2 2 C0 kf^k22 + bC0 kf^k22
Chapter 4
Proof of Proposition 8. The fact that Rpqs (Rd ) is a normed space is trivial. We now prove
the completeness. Let ffn gn 0 be a Cauchy sequence in Rpq s (Rd ). Then, of course, ffn gn 0
is a Cauchy sequence in L1 (Rd ) and, therefore, converges to a limit f in L1 . Let wj (f )(u b)
(resp. v(u b)) be (Ru f j )(b) (resp. (Ru f ')(b)). It follows that wj (fn )(u b) converges
APPENDIX A. PROOFS AND RESULTS 110
to wj (f )(u b) a.e. (and similarly for v(fn )(u b)). On the other hand, fwj (fn )(u b)gn 0 is a
Cauchy sequence in Lp (Sd;1 R) and therefore the limiting function, which coincides with
wj (f )(u b), is in Lp(Sd;1 R). Now, it follows by standard arguments that f belongs to
s (Rd ) and that fn converges in R s (Rd ) to f . Hence, R s (Rd ) is complete.
Rpq pq pq
Chapter 6
Lemma 16 The operator Ud dened by
Z1
(Ud g)(t) = (r ; t)(+d;3)=2 g(r) dr
0
s+(d;1)=2 (0 1).
s (0 1) to Bpq
is bounded from Bpq
Proof of Lemma. Let ('j0 k ) jk , 0 k < 2j j j0 , be a wavelet basis on the interval
(0 1). We will assume that ('j0 k ) jk , are R times di erentiable and are of compact
support (support width C 2;j .) In addition, we will suppose that the jk 's have vanishing
moments through order D. We will denote by j0 k = hf 'j0 k i and jk = hf jk i the
wavelet coe cients of f . We recall that each Besov space Bpqs (0 1) with (1=p ; 1) < s <
min(R D) and 0 < p q 1 is characterized by its coe cients in a sense that
X
kf kBpqs (01) k(j k )k k`p + k(2j 2j(1=2
0
;1=p)
( jjkjp)1=p)j j k`q
0 (A.5)
k
is an equivalent norm to the norm of Bpq s (0 1).
Further, suppose one wants to bound the norm of an operator T : Bpq s (0 1) ! B (0 1)
pq
for all possible choices of 1 p q 1. (The norm being de ned as sup kTf kBpq =kf kBpq
s .)
The theory of interpolation tells that it is su cient to check the boundedness for cases
(p q) 2 f(1 1) (1 1) (1 1) (1 1)g. That is what we shall verify for Ud . Now let
R R
gj0 (t) = (y ; t)(+d;3)=2 'j0 0 (y) and gj0 (t) = (y ; t)(+d;3)=2 j0 0 (t). We have for t 2 R
Z
gj0 (t) = (y ; t)(+d;3)=2 2j0=2 '(2j0 y):
Moreover, for ` min(R D) we have
Z
gj0 (t) = (y ; t)(+d;3)=2 2j0 =2 (2j0 y):
APPENDIX A. PROOFS AND RESULTS 111
Then of course,
;2
j ( )j 2
0 ; min(jj 0 )(d;1)=2 ;jj ;j 0 j(n+1=2)
2 1 + 2min(jj 0 ) jk2;j ; k0 2;j 0 j :
From this last inequality, it follows that
(
X C 2;j 0(d;1)=2 2;(j;j0)(n;1=2) j 0 j
sup
k0 2j0 k2j
j ( )j 0
C 2;j(d;1)=2 2;(j ;j 0 )(n+1=2) j 0 j
(A.6)
and
(
X C 2;j0(d;1)=2 2;(j;j0)(n+1=2) j 0 j
sup
k 2 j k 0 2 0
j ( )j 0
C 2;j(d;1)=2 2;(j0 ;j )(n;1=2) j 0 j
: (A.7)
j
But (A.7) yields
X
sup0 2j (s+(d;1)=2;1=2) 2;j 0 (s;1=2) j ( 0 )j
X X
sup0
j0 j j 0
2;(j 0 ;j )(s;1=2) 2;(j ;j 0 )(n+1=2) + 2(j ;j 0 )(s+(d;1)=2;1=2) 2;(j ;j 0)(n;1=2)
j>j 0
C
where the last inequality holds as long as n > s+(d;1)=2. This fact implies the boundedness
for (1 1).
Case (1 1). This time,
0
APPENDIX A. PROOFS AND RESULTS 112
X Xj
2(j ;j 0 )(s;1=2) 2;(j 0 ;j )(n;1=2) + 2(j ;j 0 )(s+(d;1)=2;1=2) 2;(j ;j 0 )(n+1=2)
j0 j j 0 j>j 0
C
where again, the last inequality is veri ed if n > s + (d ; 3)=2:
The case (1 1) is analogous. The lemma is proved.
References
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. Inform. Theory, 39, 930{945.
Benveniste, A., and Zhang, Q. (1992). Wavelet networks. IEEE Transactions on Neural
Networks, 3, 889{898.
Bernier, D., and Taylor, K. F. (1996). Wavelets from square-integrable representations.
SIAM J. Math. Anal, 27, 594{608.
Boas, R. P., Jr. (1952). Entire functions. New York: Academic Press.
Candes, E. J. (1996). Harmonic analysis of neural netwoks. To appear in Applied and
Computational Harmonic Analysis.
Cheng, B., and Titterington, D. M. (1994). Neural networks: a review from a statistical
perspective. With comments and a rejoinder by the authors. Stat. Sci., 9, 2{54.
Conway, J. H., and Sloane, N. J. A. (1988). Sphere packings, lattices and groups. New
York: Springer-Verlag.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math.
Control Signals Systems, 2, 303{314.
Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia, PA: Society for Industrial
and Applied Mathematics.
Daubechies, I., Grossmann, A., and Meyer, Y. (1986). Painless nonorthogonal expansions.
J. Math. Phys., 27, 1271{1283.
113
REFERENCES 114
Deans, S. R. (1983). The Radon transform and some of its applications. John Wiley &
Sons.
DeVore, R. A., Oskolkov, K. I., and Petrushev, P. P. (1997). Approximation by feed-forward
neural networks. Ann. Numer. Math., 4, 261{287.
DeVore, R. A., and Temlyakov, V. N. (1996). Some remarks on greedy algorithms. Adv.
Comput. Math., 5, 173{187.
Donoho, D. L. (1993). Unconditional bases are optimal bases for data compression and for
statistical estimation. Applied and Computational Harmonic Analysis, 1, 100{115.
Donoho, D. L. (1996). Unconditional bases and bit-level compression. Applied and Com-
putational Harmonic Analysis, 3, 388{392.
Donoho, D. L. (1997). Fast ridgelet transform in two dimensions (Tech. Rep.). Department
of Statistics, Stanford CA 94305{4065: Stanford University.
Donoho, D. L., and Johnstone, I. M. (1989). Projection-based approximation and a duality
with kernel methods. Ann. Statist., 17, 58{106.
Donoho, D. L., and Johnstone, I. M. (1995). Empirical atomic decomposition.
Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Wavelet
shrinkage: asymptopia? J. Roy. Statist. Soc. Ser. B, 57, 301{369.
Du n, R. J., and Schae er, A. C. (1952). A class of nonharmonic Fourier Series. Trans.
Amer. Math. Soc., 72, 341-366.
Du o, M., and Moore, C. C. (1976). On the regular representation of a nonunimodular
locally compact group. J. Functional Analysis, 21, 209{243.
Feichtinger, H. G., and Gr!ochenig, K. (1988). A uni ed approach to atomic decomposi-
tions via integrable group representations. In Function spaces and applications (Lund,
1986). Berlin-New York: Springer.
Frazier, M., Jawerth, B., and Weiss, G. (1991). Littlewood-Paley theory and the study of
function spaces (Vol. 79). Providence, RI: American Math. Soc.
REFERENCES 115
Friedman, J. H., and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist.
Assoc., 76, 817{823.
H!ardle, W. (1990). Applied nonparametric regression. Cambridge, England: Cambridge
University Press.
Hasminskii, R., and Ibragimov, I. (1990). On density estimation in the view of Kolmogorov's
ideas in approximation theory. J. Amer. Statist. Assoc., 18, 999{1010.
Holschneider, M. (1991). Inverse Radon transforms through inverse wavelet transforms.
Inverse Problems, 7, 853{861.
Huber, P. J. (1985). Projection pursuit, with discussion. Ann. Statist., 13, 435{525.
Jones, L. K. (1992a). A simple lemma on greedy approximation in Hilbert space and
convergence rates for projection pursuit regression and neural network training. Ann.
Statist., 20, 608{613.
Jones, L. K. (1992b). On a conjecture of Huber concerning the convergence of projection
pursuit regression. Ann. Statist., 15, 880{882.
Katznelson, Y. (1968). An introduction to harmonic analysis. New York: Wiley.
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function.
Neural Networks, 6, 861{867.
Mallat, S., and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE
Transactions on Signal Processing, 41, 3397{3415.
Meyer, Y. (1992). Wavelets and operators. Cambridge University Press.
Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic
functions. Neural Computation, 8, 164{177.
Mhaskar, H. N., and Micchelli, C. A. (1995). Degree of approximation by neural and
translation networks with a single hidden layer. Adv. in Appl. Math., 16, 151{183.
Montgomery, H. L. (1978). The analytic principle of the large sieve. Bull. Amer. Math.
Soc., 84, 547{567.
REFERENCES 116