Ridgelets Theory and Applicationsthesis

RIDGELETS:
THEORY AND APPLICATIONS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Emmanuel Jean Candes

August 1998
c Copyright 1998 by Emmanuel Candes
All Rights Reserved
ii
I certify that I have read this dissertation and that in my opinion
it is fully adequate, in scope and in quality, as a dissertation for
the degree of Doctor of Philosophy.
David L. Donoho
(Principal Adviser)

Iain M. Johnstone

George C. Papanicolaou
Approved for the University Committee on Graduate Studies:
iii
Abstract
Single hidden-layer feedforward neural networks have been proposed as an approach to
bypass the curse of dimensionality and are now becoming widely applied to approximation
or prediction in applied sciences. In that approach, one approximates a multivariate target
function by a sum of ridge functions this is similar to projection pursuit in the literature
of statistics. This approach poses new and challenging questions both at a practical and
theoretical level, ranging from the construction of neural networks to their e ciency and
capability. The topic of this thesis is to show that ridgelets, a new set of functions, provide
an elegant tool to answer some of these fundamental questions.
In the rst part of the thesis, we introduce a special admissibility condition for neural
activation functions. Using an admissible neuron, we develop two linear transforms, namely
the continuous and discrete ridgelet transforms. Both transforms represent quite general
functions f as a superposition of ridge functions in a stable and concrete way. A frame of
\nearly orthogonal" ridgelets underlies the discrete transform.
In the second part, we show how to use the ridgelet transform to derive new approxi-
mation bounds. That is, we introduce a new family of smoothness classes and show how
they model \real-life" signals by exhibiting some speci c sorts of high-dimensional spatial
inhomogeneities. Roughly speaking, nite linear combinations of ridgelets are optimal for
approximating functions from these new classes. In addition, we use the ridgelet transform
to study the limitations of neural networks. As a surprising and remarkable example, we
discuss the case of approximating radial functions.
Finally, it is explained in the conclusion why these new ridgelet expansions o er decisive
improvements over traditional neural networks.
iv
Acknowledgements
First, I would like to thank my advisor David Donoho whose most creative and original
thinking have been for me a great source of inspiration. I admire his deep and penetrating
views on so many areas of the mathematical sciences and feel particularly indebted to him
for sharing his thoughts with me. Beyond the unique scientist, there is the friend whose
kindness and generosity throughout my stay at Stanford have been invaluable. I also extend
my gratitude to his wife, Miki.
I feel privileged to have had so many fantastic teachers and professors who nurtured
my love and interest for science. I owe special thanks to Patrick David and to Professor
Yves Meyer who shared their enthusiasm with me { a quality that I hope will be a lifetime
companion.
I would also like to thank Professors Jerome Friedman, Iain Johnstone and George
Papanicolaou for serving on my orals committee and for having, together with Professor
Darrell Du e, written letters of recommendation on my behalf.
I wish to thank all the people of the Department of Statistics for creating such a world-
class scienti c environment in which it is so easy to blossom especially, the faculty which
greatly enriched my scienti c experience by exposing me to new areas of research.
A short acknowledgement seems to be very little to thank my parents for their constant
love and support, and for the never-failing con dence they had in me.
My days at Stanford would not have been the same without Helen, for the countless
little things she did so that I would feel \at home." I praise the courage she found to read
and suggest improvements to this manuscript.
Finally, my deepest gratitude goes to my wife, Chiara, whose encouragement, humor
and love have made these last four years a pure enjoyment.
v
Contents
Abstract iv
Acknowledgements v
1 Introduction 1
1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Statistical Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Projection Pursuit Regression (PPR) . . . . . . . . . . . . . . . . . . 4
1.3.2 Neural Nets Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Statistical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Harmonic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 A Continuous Representation . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Discrete Representation . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.4 Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 The Continuous Ridgelet Transform 13
2.1 A Reproducing Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 A Parseval Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 A Semi-Continuous Reproducing Formula . . . . . . . . . . . . . . . . . . . 20
3 Discrete Ridgelet Transforms: Frames 23
3.1 Generalities about Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Discretization of ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
CONTENTS vii
3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Irregular Sampling Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.1 Coarse Scale Re nements . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Quantitative Improvements . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.3 Sobolev Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6.4 Finite Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Ridgelet Spaces 39
4.1 New Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Spaces on Compact Domains . . . . . . . . . . . . . . . . . . . . . . 45
4.2 R spq , A Model For A Variety of Signals . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 An Embedding Result . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Atomic Decomposition of R1s1 (d ) . . . . . . . . . . . . . . . . . . . 50
4.2.3 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Approximation 61
5.1 Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Fundamental Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Embedded Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 A Norm Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 A Jackson Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Applications and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 The Case of Radial Functions 87
6.1 The Radon Transform of Radial Functions . . . . . . . . . . . . . . . . . . . 87
6.2 The Approximation of Radial Functions . . . . . . . . . . . . . . . . . . . . 91
6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
CONTENTS viii
7 Concluding Remarks 98
7.1 Ridgelets and Traditional Neural Networks . . . . . . . . . . . . . . . . . . 98
7.2 What About Barron's Class? . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Unsolved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.2 Curved Singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A Proofs and Results 107
References 113
List of Figures
2.1 Ridgelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Ridgelet discretization of the frequency plane . . . . . . . . . . . . . . . . . 25
ix
Chapter 1
Introduction
Let f (x) : Rd ! R be a function of d variables. In this thesis, we are interested in
constructing convenient approximations to f using a system called neural networks. This
problem is of wide interest throughout the mathematical sciences and many fundamental
questions remain open. Because of the extensive use of neural networks, we will address
questions from various perspectives and use these as guidelines for the present work.
1.1 Neural Networks

A single hidden-layer feedforward neural network is the name given a function of d-variables
constructed by the rule
X
m
fm(x) = i (ki x ; bi ) (1.1)
i=1
where the m terms in the sum are called neurons the i and bi are scalars and the ki are
d-dimensional vectors. Each neuron maps a multivariate input x 2 Rd into a real valued
output by composing a simple linear projection x ! ki x ; bi with a scalar nonlinearity
, called the activation function. Traditionally, has been given a sigmoid shape, (t) =
et =(1 + et ), modeled after the activation mechanism of biological neurons. The vectors
ki specify the `connection strengths' of the d inputs to the i-th neuron the bi specify
activation thresholds. The use of this model for approximating functions in applied sciences,
engineering, and nance is large and growing for examples, see journals such as IEEE Trans.
Neural Networks.
1
CHAPTER 1. INTRODUCTION 2
From a mathematical point of view, such approximations amount to taking nite linear
combinations of atoms from the dictionary DRidge = f(k x ; b) k 2 Rd b 2 Rg of
elementary ridge functions. As is known, any function of d variables can be approximated
arbitrarily well by such combinations (Cybenko, 1989 Leshno, Lin, Pinkus, and Schocken,
1993). As far as constructing these combinations, a frequently discussed approach is the
greedy algorithm that, starting from f0 (x) = 0, operates in a stepwise fashion running
through steps i = 1 : : : m we inductively de ne
fi = fi;1 + (1 ; )(k x ; b ) (1.2)
where ( k b ) are solutions of the optimization problem
arg 0min
1
arg min
(kb)2Rn R
kf ; fi
;1 + (1 ; )(k x ; b)k2 : (1.3)
Thus, at the i-th stage, the algorithm substitutes to fi;1 a convex combination involving
fi;1 and a term from the dictionary DRidge that results in the largest decrease in approx-
imation error (1.3). It is known that when f 2 L2 (D) with D a compact set, the greedy
algorithm converges (Jones, 1992b) it is also known that for a relaxed variant of the greedy
algorithm, the convergence rate can be controlled under certain assumptions (Jones, 1992a
Barron, 1993). There are unfortunately two problems with the conceptual basis of such
results.
First, they lack the constructive character which one ordinarily associates with the
word \algorithm." In any assumed implementation of minimizing (1.3) one would need to
search for a minimum within a discrete collection of k and b. What are the properties of
procedures restricted to such collections? Or, more directly, how nely discretized must the
collection be so that a search over that collection gives results similar to a minimization over
the continuum? In some sense, applying the word \algorithm" for abstract minimization
procedures in the absence of an understanding of this issue is a misnomer.
Second, even if one is willing to forgive the lack of constructivity in such results, one
must still face the lack of stability of the resulting decomposition. An approximant fN (x) =
P N (k x ; b ) has coe cients which in no way are continuous functionals of f and
i=1 i i i
do not necessarily re ect the size and organization of f (Meyer, 1992).
1.2 Approximation Theory

Let alone the most delicate problem of their construction, one can look at neural networks
from the viewpoint of approximation: that is, to investigate the e ciency of approximation
of a function f by nite linear combinations of neurons taken from the dictionary DRidge .
Although this issue has received overwhelming attention (Barron, 1993 Cybenko, 1989 De-
Vore, Oskolkov, and Petrushev, 1997 Mhaskar, 1996 Mhaskar and Micchelli, 1995), there
are surprisingly very few decisive results about the quantitative rates of these approxima-
tions.
First, there is a series of results which essentially amount to saying that neural net-
works are at least as e cient as polynomials for approximating functions (Mhaskar, 1996
Mhaskar and Micchelli, 1995) the argument being simply that since one can nd good
approximations of polynomials using neural networks, whenever there is a good polynomial
approximation of a target function f , there is in principle a corresponding neural net ap-
proximation. Second, in a celebrated result, Barron (1993) and Jones (1992b) have been
able to bound the convergence rate of the greedy algorithm (1.2)-(1.3), when f is restricted
to satisfy some smoothness condition, namely f is a square integrable function over the unit
R
ball d of Rd such that Rn j jjfb( )jd C (here, fb denotes the Fourier transform of f ).
For this class, they show
kf ; fN k2 2CN ;1=2
(1.4)
where fN is the output of the algorithm at stage N . Their result, however, also raises a set
of challenging questions which we will now discuss.
The greedy algorithm. The work of DeVore and Temlyakov (1996) shows that the greedy
algorithm has unfortunately very weak approximation properties. Even when good approxi-
mations exist, the greedy algorithm cannot be guaranteed to nd them, even in the extreme
case where f is just a superposition of a few, say ten, elements of our dictionary DRidge .
Neural nets for which functions? It can be shown that for the class Barron considers, a
simple N -term trigonometric approximation would give better rates of convergence namely,
O(N ;(1=2+1=d) ) (and, of course, there is a real and fast algorithm). So, it would be of interest
to be able to identify functional classes for which neural networks are more e cient than
other methods of approximation or more ambitiously a class F for which it could be proved
that linear combinations of elements of DRidge give the best rate of approximation over F .
In Chapter 5, we will see how one can formalize this statement.

Better rates? Are there classes of functions (other than trivial ones) that can be ap-
proximated in O(N ;r ) for r > 1=2? In other words, if one is willing to restrict further the
set of functions to be approximated, can we guarantee better rates of convergence?
Therefore, from the viewpoint of approximation, there is a need to understand the
properties of neural net expansions, to understand what they can and what they cannot do,
and where they do well and where they do not. This is one of the main goals of the present
thesis.
1.3 Statistical Estimation

In a nonparametric regression problem, one is given a pair of random variables (X Y )
where, say, X is a d-dimensional vector and Y is real valued. Given data (Xi Yi )Ni=1 , and
the model
Yi = f (Xi ) + i (1.5)
where is the noisy contribution, one wishes to estimate the unknown smooth function f .
It is observed that well-known regression methods such as kernel smoothing, nearest-
neighbor, spline smoothing (see H!ardle, 1990 for details) may perform very badly in high
dimensions because of the so-called curse of dimensionality. The curse comes from the fact
that when dealing with a nite amount of data, the high-dimensional ball d is mostly
empty, as discussed in the excellent paper of Friedman and Stuetzle (1981). In terms of
estimation bounds, roughly speaking, the curse says that unless you have an enormous
sample size N , you will get a poor mean-squared error, say.
1.3.1 Projection Pursuit Regression (PPR)

In an attempt to avoid the adverse e ects of the curse of dimensionality, Friedman and
Stuetzle (1981) suggest approximating the unknown regression function f by a sum of ridge
functions,
X
m
f (x) gj (uj x)
j =1
where the uj 's are vectors of unit length, i.e. kuj k = 1. The algorithm, the statistical
analogy of (1.2)-(1.3), also operates in a stepwise fashion. At stage m, it augments the t
fm;1 by adding a ridge function gj (uj x) obtained as follows: calculate the residuals of
P
the m ; 1th t ri = Yi ; mj =1;1 gj (uj Xi ) and for a xed direction u plot the residuals ri
against u xi t a smooth curve g and choose the best direction u, so as to minimize the
P
residuals sum of squares i (ri ; g(u Xi ))2 . The algorithm stops when the improvement
is small.
The approach was revolutionary because instead of averaging the data over balls, PPR
performs a local averaging over narrow strips: ju x ; tj h, thus avoiding the problems
relative to the sparsity of the high-dimensional unit ball.
1.3.2 Neural Nets Again

Neural nets are also very much in use in statistics for regression, classi cation, discrimina-
tion, etc. (see the survey of Cheng and Titterington, 1994 and its joined discussion). In
regression, where the training data is again of the form (Xi Yi ), neural nets t the data
with a sum of the form
X
m
y^(x) = j (kj x ; bj )
j =1
where kj 2 Rd and bj 2 R, so that the t is exactly like (1.1). Again, the sigmoid is most
commonly used for .
Of course, PPR and neural nets regression are of the same avor as both attempt to
approximate the regression surface by a superposition of ridge functions. One of the main
di erences is perhaps that neural networks allow for a non-smooth t since (k x ; b)
resembles a step function when the norm kkk of the weights is large. On the other hand,
PPR can make better use of projections since it bears the freedom to choose a di erent
pro le g at each step.
1.3.3 Statistical Methodology

In approximation theory, given a dictionary D = fg 2 #g (where # denotes some index
set), one tries to build up an approximation by taking out nite linear combinations
X
N
fN (x) = i gi (x):
i=1
Likewise, in statistics, almost all current nonparametric regression methods use selection of
elements from D to construct an estimate
X N
f^(x) = i gi (x)
i=1
of the unknown f (1.5). Following Breiman's discussion (Cheng and Titterington, 1994),
examples include cases where D is a set of indicator functions D = f1fx2R g g where the
R's are rectangles (CART) the case where elements of D are products of univariate splines
D = fQdj=1 (xj ; ij )+g (MARS) and many others including the neural nets dictionary
DRidge. One of the most remarkable and beautiful examples concerns the case where D is a
wavelet basis, as in this case both fast algorithms and near-optimal theoretical results are
available, see Donoho, Johnstone, Kerkyacharian, and Picard (1995).
PPR and neural nets are used every day in data analysis, but not much is known
about their capability. We feel that there is a need to get an intellectual understanding
of these projection-based methods. What can neural networks achieve? For which kinds
of regression surface f will they give good estimates? How can a good subset of neurons
(k x ; b) be selected? It is common sense that PPR or neural nets will have a small
prediction error if, and only if, superpositions of ridge functions like (1.1) approximate
the regression surface rather well. In fact, the connection between approximation theory
and statistical estimation is very deep (see, for instance, Hasminskii and Ibragimov, 1990
Donoho and Johnstone, 1989 Donoho, 1993 Donoho and Johnstone, 1995) to the point
that in some cases, the two problems become hardly distinguishable, as shown in Donoho
(1993), for example. Therefore, a lot of questions are common with the ones spelled out in
the previous section.
1.4 Harmonic Analysis

It is well known that trigonometric series provide poor reconstructions of singular signals.
For instance, let H (x) be the step function 1fx>0g on the interval $; ]. The best L2 N -
term approximation of H by trigonometric series gives only a L2 error of order O(N ;1=2 ).
One of the many reasons that make wavelets so attractive is that they are the best bases for
representing objects composed with singularities (see the discussion of Mallat's heuristics
in Donoho, 1993). In a nice wavelet basis, the L2 approximation error is O(N ;s ) for every
possible choice of s. However, under a certain viewpoint, the picture changes dramatically
when the dimension is greater than one. In the unit Q of Rd , say that we want to rep-
resent again the step function H (u x ; t) then, O( ;2(d;1) ) wavelets are needed to give
1
a reconstruction error of order (i.e. convergence in O(N ; 2(d;1) ) of N -term expansions).
Translated into the framework of image compression, it says that both wavelets bases and
Fourier bases are severely ine cient at representing edges in images.
In harmonic analysis, there has recently been much interest in nding new dictionaries
and ways of representing functions by linear combinations of elements of those. Examples
include wavelets, wavelet-packets, Gabor functions, brushlets, etc. However, there aren't
any representations that represent objects like H (u x ; t) e ciently. From this point of
view, it would be interesting to develop one which would represent step functions as well as
wavelets do in one dimension.
1.5 Achievements
The thesis is about the important issues that have just been addressed. Our goal here is
to apply the concepts and methods of modern harmonic analysis to tackle these problems,
starting with the primary one: the problem of constructing neural networks.
Using techniques developed in group representations theory and wavelet analysis, we
develop two concrete and stable representations of functions f as superpositions of ridge
functions. We then use these new expansions to study nite approximations.
1.5.1 A Continuous Representation

In Chapter 2, we develop the concept of admissible neural activation function : R ! R.
Unlike traditional sigmoidal neural activation functions which are positive and monotone
increasing, such an admissible activation function is oscillating, taking both positive and
negative values. In fact, our condition requires for a number of vanishing moments which
are proportional to the dimension d, so that an admissible has zero integral, zero àverage
slope,' zero àverage curvature,' etc. in high dimensions.
We show that if one is willing to abandon the traditional sigmoidal neural activation
function , which typically has no vanishing moments and is not in L2 , and replace it by an
admissible neural activation function , then any reasonable function f may be represented
exactly as a continuous superposition from the dictionary DRidgelet = f : 2 ;g of
ridgelets (x) = a;1=2 ( uxa;b ) where the ridgelet parameter = (a u b) runs through the
set ; f(a u b) a b 2 R a > 0 u 2 Sd;1 g with Sd;1 denoting the unit sphere of Rd .
In short, we establish a continuous reproducing formula
Z
f = c hf i (d ) (1.6)
for f 2 L1 \ L2(Rd), where c is a constant which depends only on and (d ) /

da=an+1 dudb is a kind of uniform measure on ; for details, see below. We also estab-
lish a Parseval relation
Z
kf k2 = c jhf ij2(d ): (1.7)
These two formulas mean that we have a well-de ned continuous Ridgelet transform R(f )( ) =
hf i taking functions on Rd isometrically into functions of the ridgelet parameter =
(a u b).
1.5.2 Discrete Representation

We next develop somewhat stronger admissibility conditions on (which we call frameability
conditions) and replace this continuous transform by a discrete transform (Chapter 3). Let
D be a xed compact set in Rd. We construct a special countable set ;d ; such that
every f 2 L2 (D) has a representation
X
f= (1.8)
2;d
with equality in the L2 (D) sense. This representation is stable in the sense that the co-
e cients change continuously under perturbations of f which are small in L2 (D) norm.
Underlying the construction of such a discrete transform is, of course, a quasi-Parseval
relation, which in this case takes the form
X
Akf k2L2 (D) jhf iL (D)j2 Bkf k2L (D)
2 2 (1.9)
2;d
Equation (1.8) follows by use of the standard machinery of frames (Du n and Schae er,
1952 Daubechies, 1992). Frame machinery also shows that the coe cients are realiz-
able as bounded linear functionals (f ) having Riesz representers ~ (x) 2 L2 (D). These
representers are not ridge functions themselves but by the convergence of Neumann series
underlying the frame operator, we are entitled to think of them as molecules made up of
linear combinations of ridge atoms, where the linear combinations concentrate on atoms
with parameters 0 \near" .
1.5.3 Applications
As a result of Chapters 2 and 3, we are, roughly speaking, in a position to e ciently
construct nite approximations by ridgelets which give good approximations to a given
function f 2 L2 (D). One can see where the tools we have constructed are heading: from
the exact series representation (1.8), one aims to extract a nite linear combination which
is a good approximation to the in nite series once such a representation is available, one
has a stable, mathematically tractable method of constructing approximate representations
of functions f based on systems of neuron-like elements.
New functional classes. Rephrasing a comment made in section 1.2, it is natural to ask
for which functional classes do ridgelets make sense. That is, what are the classes they
approximate best? To explain further what we mean, suppose we are given a dictionary
D = fg 2 #g. For a function f , we de ne its approximation error by N -elements of the
dictionary D by
X
N
infN infN kf ; i gi kH dN (f D): (1.10)
(i )i=1 (i )i=1 i=1
Suppose now that we are interested in the approximation of classes of functions characterize
the rate of approximation of the class F by N elements from D by
dN (F D) = sup dN (f D): (1.11)

f 2F
In Chapter 4, we introduce a new scale of functional classes, not currently studied in

harmonic analysis, which are \quasi-approximation spaces" for ridgelets. That is, we show
that (Chapter 5):
(i) Optimality. There is a dictionary of ridgelet-like elements, namely the dual-ridgelet
dictionary DDual;Ridge = f ~ g 2;d , that is optimal for approximating functions from
these classes. In other words, there isn't any other dictionary with better approxima-
tion properties in the sense of (1.11).
(ii) Constructive approximation. There is an approximation scheme that is optimal for
approximating functions from these classes. From the exact series representation
X
f= hf i ~
2;d
extract the N -term approximation f~N where one only keeps the dual-ridgelet terms
corresponding to the N largest ridgelet coe cients hf i then, the approximant f~N
achieves the optimal rate of approximation over our new classes.
In Chapter 4, we give a description of these new spaces in terms of the smoothness of the
Radon-transform of f . Furthermore, we explain how these spaces model functions that are
singular across hyperplanes when there may be an arbitrary number of hyperplanes which
may be located in any spatial positions and may have any orientations.
Specic examples. We study degrees of approximations over some speci c examples. For
example, we will show in Chapter 5 that the goals set in section 1.4 are ful lled. Although
ridgelets are optimal for representing objects with singularities across hyperplanes, they
fail to represent e ciently singular radial objects (Chapter 6) i.e., when singularities are
associated with spheres and more generally with curved hypersurfaces. In some sense, we
cannot curve the singular sets.
Superiority over traditional neural nets. In Neural Networks, one considers approxima-
tions by nite linear combinations taken from the dictionary DNN = f(k x ; b) k 2
Rn b 2 Rg, where is the univariate sigmoid, see Barron (1993) for example. It is shown
that for any function f 2 L2 (d ), there is a ridgelet approximation which is at least as good
- and perhaps much better - as the best ideal approximation using Neural Networks.
1.5.4 Innovations
Underlying our methods is the inspiration of modern harmonic analysis { ideas like the
Calder'on reproducing formula and the Theory of Frames. We shall brie y describe what is
new here { that which is not merely an àutomatic' consequence of existing ideas.
First, there is, of course, a general machinery for getting continuous reproducing formu-
las like (1.6), via the theory of square-integrable group representations (Du o and Moore,
1976 Daubechies, Grossmann, and Meyer, 1986). Such a theory has been applied to de-
velop wavelet-like representations over groups other than the usual ax + b group on Rd , see
Bernier and Taylor (1996). However, the particular geometry of ridge functions does not
allow the identi cation of the action of ; on with a linear group representation (notice
that the argument of is real, while the argument of is a vector in Rd ). As a conse-
quence, the possibility of a straightforward application of well-known results is ruled out.
As an example of the di erence, our condition for admissibility of a neural activation func-
tion for the continuous ridgelet transform is much stronger { requiring about d=2 vanishing
moments in dimension d { than the usual condition for admissibility of the mother wavelet
for the continuous wavelet transform, which requires only one vanishing moment in any
dimension.
Second, in constructing frames of ridgelets, we have been guided by the theory of
wavelets, which holds that one can turn continuous transforms into discrete expansions
by adopting a strategy of discretizing frequency space into dyadic coronae (Daubechies,
1992 Daubechies, Grossmann, and Meyer, 1986) this goes back to Littlewood-Paley (Fra-
zier, Jawerth, and Weiss, 1991). Our approach indeed uses such a strategy for dealing with
the location and scale variables in the ;d dictionary. However, in dealing with ridgelets
there is also an issue of discretizing the directional variable u that seems to be a new ele-
ment: u must be discretized more nely as the scale becomes ner. The existence of frame
bounds under our discretization shows that we have achieved, in some sense, the `right'
discretization, and we believe this to be new and of independent interest.
Third, as emphasized in the previous two paragraphs, one has available a new tool
to analyze and synthesize multivariate functions. While wavelets and related methods
work well in the analysis and synthesis of objects with local singularities, ridgelets are
designed to work well with conormal objects: objects that are singular across some family
of hypersurfaces, but smooth along them. This leads to a more general and super cial
observation: the association between neural nets representations and certain types of spatial
inhomogeneities seems, here, to be a new element.
Next, there is a serious attempt in this thesis to characterize and identify functional
classes that can be approximated by neural nets at a certain rate. Unlike well grounded area
of approximation theory, neural network theory does not solve the delicate characterization
issue. In wavelet or spline theory, it is well known that the e ciency of the approximation
is characterized by classical smoothness (Besov spaces). In contrast, it is necessary in
addressing characterization issues of neural nets approximation to abandon the classical
measure of smoothness. Instead, we propose a new one and de ne a new scale of spaces
based on our new de nition. In addition to providing a characterization framework, these
spaces to our knowledge are not studied in classical analysis and their study may be of
independent interest.
We conclude this introduction by underlining perhaps the most important aspect of the
present thesis: ridgelet expansion and approximation are both constructive and e ective
procedures as opposed to existential approximations commonly discussed in the neural
networks literature (see section 1.1).
Chapter 2
The Continuous Ridgelet
Transform
In this chapter we present results regarding the existence and the properties of the contin-
uous representation (1.6). Recall that we have introduced the parameter space
; = f = (a u b) a b 2 R a > 0 u 2 Sd;1 g
and the notation (x) = a;1=2 ( uxa;b ). Of course, the parameter = (a u b) has a nat-
ural interpretation: a indexes the scale of the ridgelet u, its orientation and b, its location.
The measure (d ) on neuron parameter space ; is de ned by (d ) = ada d+1 d du db, where
d is the surface area of the unit sphere S in dimension d and du the uniform probability
d ; 1
R
measure on Sd;1 . As usual, fb( ) = e;ix f (x)dx denotes the Fourier transform of f and
F (f ) as well. To simplify notation, we will consider only the case of multivariate x 2 Rd
with d 2. Finally, we will always assume that : R ! R belongs to the Schwartz space
S (R). The results presented here hold under weaker conditions on , but we avoid study
of various technicalities in this chapter.
We now introduce the key de nition of this chapter.
De nition 1 Let : R ! R satisfy the condition
Z j b()j2
K = jjd d < 1: (2.1)
Then is called an Admissible Neural Activation Function.

13
CHAPTER 2. THE CONTINUOUS RIDGELET TRANSFORM 14
$Original ridgelet.] $After rescaling.]
$After shifting.] $After rotation.]
Figure 2.1: Ridgelets.
We will call the ridge function generated by an admissible a ridgelet.
2.1 A Reproducing Formula

We start by the fundamental reconstruction principle that will be extended to more general
functions in the next section.
Theorem 1 (Reconstruction) Suppose that f and fb 2 L1 (Rd ). If is admissible, then
Z
f = c hf i (d ) (2.2)
where c = (2 );d K;1 .

Remark 1. In fact, for 2 S (R), the admissibility condition (2.1) is essentially equiva-
lent to the requirement of vanishing moments:
Z d + 1
tk (t)dt = 0 k 2 f0 1 : : : 2 ; 1g:
This clearly shows the similarity of (2.1) to the 1-dimensional wavelet admissibility condition
(Daubechies, 1992, Page 24) however, unlike wavelet theory, the number of necessary
vanishing moments grows linearly in the dimension d.
Remark 2. If (t) is the sigmoid function et =(1+et ), then is not admissible. Actually no
formula like (2.2) can hold if one uses neurons of the type commonly employed in the theory
of Neural Networks. However, (m) (t) is an admissible activation function for m $ d2 ] + 1.
Hence, su ciently high derivatives of the functions used in Neural Networks theory do lead
to good reconstruction formulas.
of Theorem 1. The proof uses the Radon Transform Ru de ned by: Ru f (t) =
R f (Proof
tu + U ?s)ds with s = (s1 : : : sd;1 ) 2 Rd;1 and U ? an d (d ; 1) matrix containing
as columns an orthonormal basis for u?.
With a slight abuse of notation, let a (x) = a; 12 ( xa ) and e(x) = (;x). Put wau (b) = ea
R R
Ruf (b) and let I = hf i (x)(d ) = a (u x ; b)wau (b) ada d
d+1 d du db. Recall Ru f =
fb(u) and, hence, if fb 2 L1 (Rd ), Rd
R
u f 2 L1 (R). Then, I = a ( ea Ru f )(u x) adda+1 d du.
Noting that a ( ea Ru f ) 2 L1 (R) and that its 1-dimensional Fourier transform is given
by aj b(a )j2 f^(u), we have
Z
I = 21 expfiu xgfb(u)aj b(a )j2 ada
d+1 d du d:
If is real valued, b(; ) = b( ) hence,

Z
I= 1 expfiu xgfb(u)aj b(a )j2 1f >0g ada
d+1 d du d:
Then, by Fubini
Z Z
I = 1 b
expfiu xgf (u) j b(a)j2 da
ad 1f >0g dd du
Z
= 1 expfiu xgfb(u)K j jd;1 1 d du
f >0g d
Z
= 1 K expfix kgfb(k)dk
Rd
1
= K (2 )d f (x):

Integral representations like (2.2) have been independently discovered in Murata (1996).
2.2 A Parseval Relation

Theorem 2 (Parseval relation) Assume f 2 L1 \ L2 (Rd ) and admissible. Then
Z
kf k2
2 = c jhf ij2(d ):
Proof. With wau (b) de ned as in the proof of Theorem 1, we then have
Z Z
jhf ij2(d ) = jwau(b)j2 adad+1 ddu db = I
say. Using Fubini's theorem for positive functions,
Z da du db = Z kw k2 da du:
jwau j
(b) 2 ad+1 d au 2 ad+1 d (2.3)
wau is integrable, being the convolution between two integrable functions, and belongs to
L2 (R) since kwauk2 kf k1k a k2 its Fourier transform is then well deZ ned and wbau ( ) =
ba ()fb(u). By the usual Plancherel theorem, R jwau(b)j2 db = 1 jwbau()j2 d and,
2
hence,
Z da Z
I = 21 b b
jf (u)j j a()j ad+1 ddu d = 2
2 2 2
f >0g
jfb(u)j2j b(a)j2 da
ad d du d:
Since
R j b(a)j2 da = K jjd;1 (admissibility), we have
ad
2 K Z
I = 2 jfb(u)j2d;1ddu = 1 K (2 )dkf k22:
The assumptions on f in the above two theorems are somewhat restrictive, and the
basic formulas can be extended to an even wider class of objects. It is classical to de ne
the Fourier Transform rst for f 2 L1 (Rd ) and only later to extend it to all of L2 using the
fact that L1 \ L2 is dense in L2 . By a similar density argument, one obtains
Proposition 1 There is a linear transform R: L2(Rd ) ! L2(; (d )) which is an L2-
isometry and whose restriction to L1 \ L2 satises
R(f )( ) = hf i:
For this extension, a generalization of the Parseval relationship (1.7) holds.
Proposition 2 (Extended Parseval) For all f g 2 L2(Rd ),
Z
hf gi = c R(f )( )R(g)( )(d ): (2.4)
Proof of Proposition 2. Notice that one needs only to prove the property for a dense
subspace of L2 (Rd ) i.e., L1 \ L2 (Rd ). So let f g 2 L1 \ L2 we can write
Z Z
R(f )( )R(g)( )(d ) = h ea f ea gi adad+1 ddu = I:
\\
Applying Plancherel
Z
I = 21 h ea f ea gi adad+1 ddu
Z
= 21 fb(u)gb(u)aj b(a)j2 da
a d+1 d du d
and, by Fubini, we get the desired result.

R
Relation (2.4) allows identi cation of the integral c hf i (d ) with f by duality.
R
In fact, taking the inner product of c hf i (d ) with any g 2 L2 (Rd ) and exchanging
the order of inner product and integration over , one obtains

Z Z
hc hf i (d ) g i = c hf ihg i(d ) = hf gi
R
which by the Riesz theorem leads to f c hf i (d ) in the prescribed weak sense.
The theory of wavelets and Fourier analysis contain results of a similar avor: for
example, the Fourier inversion theorem in L2 (Rd ) can be proven by duality. However,
there exists a more concrete proof of the Fourier inversion theorem. Recall, in fact, that if
f 2 L1 \ L2 (Rd ) and if we consider the truncated Fourier expansion fbK () = fb()1fj jK g,
then fbK 2 L1 (Rd ) and kF (fbK ) ; (2 )d f kL2 ! 0 as K ! 1. This argument provides an
interpretation of the Fourier inversion formula that reassures about its practical relevance.
We now give a similar result for the convergence of truncated ridgelet expansions. For
each " > 0, de ne ;" := f = (a u b) : " a ";1 u 2 Sd;1 b 2 Rg ;.
Proposition 3 Let f 2 L1 (Rd ) and f g = fhf ig
( 2;) then for every ">0
1;" ( ) 2 L1(; (d )):
Proof. Notice that = ( ea Ru f )(b) then
Z Z
;"
j j(d ) = jwau(b)j adad+1 ddu db
Z ";
k k1 adda+ < 1
1
dkf k1 "
1
2
where we have used kwau k1 k e k1 kf k1 = a1=2 k k1 1kf k1 :

The above proposition shows that for any f 2 L1 (Rd ), the expression
Z
f" c hf i (d )
;"
is meaningful, since f g 2; is uniformly L1 bounded over ;" . The next theorem makes
more precise the meaning of the reproducing formula.
Theorem 3 Suppose f 2 L1 \ L2(Rd ) and admissible.
(1) f" 2 L2(Rd ), and
(2) kf ; f"k2 ! 0 as " ! 0.

Proof of Theorem 3.
1 d
Step 1 Letting (x) = ( ) 2 expf;
kxk2 g and de ning f" as
2 2
Z
f" = c
;"
hf i (d )
(2 )1=2 2
Repeating the argument in the proof of Theorem 1, we get
u u \
we start proving that f" 2 L2 (Rd ). Notice that Ru (f ) = Ru f Ru and Ru (t) =
1 expf; t2 g . Now F (R f R )( ) = (Rdf R )( ) = fb(u) expf; 2 g.
u u 2
cZ Z da j b(a )j2 expfiu x ; 2 gfb(u) ddu:

f" = d
f >0gSd;1 "a";1 ad 2
Z "; Z "; j j
b(a)j2 dad = jjd;1 b(t)j2 dtd (which we will
1 1
Note that for 6= 0, we have j a j t

" "j j
abbreviate as K j j c" (j j)) and c" (j j) " 1 as " ! 0: After the change of variable k = j ju,
d; 1
we obtain Z
f" = K expfik x ; k2kk gc" (kkk)fb(k)dk
c 2

which allows the interpretation of f" as the \conjugate" Fourier transform of an L2 element
and therefore the conclusion f" 2 L2 (Rd ).
Step 2 We aim to prove that f" ! f" pointwise and in L2 (Rd ). The dominated conver-
gence theorem leads to
c" (kkk)fb(k) expf; 2 kkk2 g ;! c" (kkk)fb(k) in L2 (Rd ) as ! 0:

Then by the Fourier Transform isometry, we have f" ! (2 );d FT (c" fb) in L2 (Rd ). It
remains to be proved that this limit, which we will abbreviate with g" , is indeed f" :
Z
jf" (x) ; f"(x)j = c
;"
(hf i ; hf i) (d )
Z "; Z
ea (Ruf Ru ; Ruf )k1 da
1
c sup j (x)j
2;" " Sd;
k1 ad+1 d du
Z "; Z
k eak1kRuf Ru ; Ruf k1 adad+1 ddu
1
c "; 12 k k "1
Sd;1
Z "; 1
da Z
= c " ; 12
kk "1
ad+ 21
k k1 Sd; kRuf Ru ; Ruf k1ddu:
1
Then for a xed u, kRu f Ru ; Ru f k1 ! 0 as! 0 and

kRuf Ru ; Ruf k1 kRuf k1 + kRuf Ruk1
2kRuf k1 2kf k1:
R
Thus by the dominated convergence theorem, Sd; kRu f Ru ; Ru f k1 d du ! 0.
R 1
From jf" (x) ; f"(x)j (")k k k1 Sd; kRu f Ru ; Ru f k1 d du, we obtain kf" ;
1 1
f"k ! 0 as ! 0. Note that the convergence is in C (Rd ) as the functions are continuous.
1
Finally, we get f" = g" and, therefore, f" is in L2 (Rd ) by completeness.
To show that kf" ;f k2 ! 0 as " ! 0, it is necessary and su cient to show that kfb" ;fbk2 ! 0,
Z
kf" ; f k2 = jfb(k)j2 (1 ; c"(kkk)2 dk:
b b 2
Recalling that 0 c" 1 and that c" " 1 as " ! 0, the convergence follows.
2.3 A Semi-Continuous Reproducing Formula

We have seen that any function f 2 L1 \ L2 (Rd ) might be represented as a continuous
superposition of ridge functions
Z
f = c hf (x) a1 ( u xa ; b )i a1 ( u xa ; b ) da
ad dudb (2.5)
and the sense in which the above equation holds. Now, one can obtain a semi-continuous
version of (2.5) by replacing the continuous scale by a dyadic lattice. The motivation for
doing so will appear in the later chapters. Let us choose such that
X j ^(2;j )j2
j2 j jd
; ;1
= 1: (2.6)
j 2Z
Of course, this condition greatly resembles the admissibility condition (2.1) introduced
earlier. If one is given a function ( such that
X
j((2
^ j )j2 = 1
;
j 2Z
it is immediate to see that de ned by ^( ) = j j(d;1)=2 (^ ( ) will verify (2.6). Now, using
the same argument as for Theorems 1 and 3, the property (2.6) implies
X Z
f/ 2j (d;1) hf (x) 2j (2j (u x ; b))i2j (2j (u x ; b))dudb
j 2Z
where again if f 2 S (Rd ), the inequality holds in a pointwise way and more generally if
f 2 L1 \ L2 (Rd ), the partial sums of the right-hand side are square integrable and converge
to f in L2 . Finally, as in wavelet theory, it will be rather useful to introduce some special
coarse scale ridgelets. We choose a pro le ' so that
j'^()j2 = X 2j(d 1)j ^(2 j )j2:

; ;
j<0
As a consequence, we have that for any 2 R
j'^()j2 + X 2j(d 1) j ^(2 j )j2 = jjd

; ; ;1
: (2.7)
j 0
Notice, the above equality implies j'^( )j2 j jd;1 , which is very much unlike Littlewood-
Paley or wavelet theory: our coarse scale ridgelets are also oscillating since '^ must have
some decay near the origin, that is, ' itself must have some vanishing moments. (In fact,
' is \almost" an Admissible Neural Activation Function: compare with (2.1)).
For a pair (' ) satisfying (2.7), we have the following semi-continuous reproducing
formula
Z Z
f/ hf (x) '(u x ; b)i'(u x ; b) + X 2j(d ;1)
hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(2.8)
where as in Littlewood Paley theory, j stands for 2j (2j ). At this point, the reader knows
in which sense (2.8) must be interpreted.
Chapter 3
Discrete Ridgelet Transforms:
Frames
The previous chapter described a class of neurons, the ridgelets f g 2; , such that
(i) any function f can be reconstructed from the continuous collection of its coe cients
hf i, and
(ii) any function can be decomposed in a continuous superposition of neurons .
The purpose of this chapter is to achieve similar properties using only a discrete set of
neurons ;d ;.
3.1 Generalities about Frames

The theory of frames (Daubechies, 1992 Young, 1980) deals precisely with questions of this
kind. In fact, if H is a Hilbert space and f'n gn2N a frame, an element f 2 H is completely
characterized by its coe cients fhf 'n ign2N and can be reconstructed from them via a
simple and numerically stable algorithm. In addition, the theory provides an algorithm to
express f as a linear combination of the frame elements 'n .
De nition 2 Let H be a Hilbert space and let f'n gn2N be a sequence of elements of H.
Then f'n gn2N is a frame if there exist 0 < A B < 1 such that for any f 2 H
X
Akf k2H jhf 'ni j2 Bkf k2
H H (3.1)
n2N
23
CHAPTER 3. DISCRETE RIDGELET TRANSFORMS: FRAMES 24
in which case A and B are called frame bounds.

Let H Pbe a Hilbert space and f'ngn N a frame with bounds A and B. Note that
2
Akf k jhf 'n ij2 implies that f'n gn N is a complete set in H. A frame f'n gn N is
2
H 2 2
said to be tight if we can take A = B in De nition 1. Furthermore, if f'n gn N is a basis
2
for H, it is called a Riesz basis. Simple examples of Frames include Orthonormal Bases,
Riesz Bases, nite concatenations of several Riesz Bases,etc.
The following results are stated without proofs and can be found in Daubechies (1992,
Page 56) and Young (1980, Page 184). De ne the coe cient operator F : H ! l2 (N ) by
F (f ) = (hf 'n i)n2N . Suppose that F is a bounded operator (kFf k B kf kH ). Let F
be the adjoint of F and let G = F F be the Frame Operator then A Id G B Id in
the sense of orders of positive de nite operators. Hence, G is invertible and its inverse G;1
satis es B ;1 Id G;1 A;1 Id. De ne 'en = G;1 'n then f'en gn2N is also a frame (with
frames bounds B ;1 and A;1 ) and the following holds:
X X
f= hf 'eni H 'n = hf 'ni 'en:
H (3.2)
n2N n2N
P P
Moreover, if f = n2N an 'n is an another decomposition of f , then n2N jhf 'en ij2
P
n2N jan j . To rephrase Daubechies, the frame coe cients are the most economical in
2
an L2 sense. Finally, G = A+2 B (I ; R) where kRk < 1, and so G;1 can be computed as
G;1 = A+2 B 1
P Rk .
k=0
3.2 Discretization of ;
The special geometry of ridgelets imposes di erences between the organization of ridgelet
coe cients and the organization of traditional wavelet coe cients.
With a slight change of notation, we recall that = a1=2 (a(u x ; b)). We are looking
for a countable set ;d and some conditions on such that the quasi-Parseval relation (1.9)
holds. Let R(f )( ) = hf i then R(f )( ) = hRu f ab i with ab (t) = a1=2 (a(t ; b)).
Thus, the information provided by a ridgelet coe cient R(f )( ) is the one-dimensional
wavelet coe cient of Ru f , the Radon transform of f . Applying Plancherel, R(f )( ) may
be expressed as
1 a ;1=2 Z
R(f )( ) = 2 hRuf abi = 2 fb(u) b(=a) expfibgd
d b (3.3)
which corresponds to a one-dimensional integral in the frequency domain (see Figure 1).
In fact, it is the line integral of fbba0 , modulated by expfib g, along the line ftu : t 2
Rg. If, as in the Littlewood-Paley theory (Frazier, Jawerth, and Weiss, 1991) a = 2j and
supp( ) $1=2 2], it emphasizes a certain dyadic segment ft : 2j t 2j +1 g. In contrast,
in the multidimensional wavelets case where the wavelet ab = a; d2 ( x;a b ) with a > 0 and
b 2 Rd , the analogous inner product hf ab i corresponds to the average of fbba over the
whole frequency domain, emphasizing the dyadic corona f : 2j j j 2j +1 g.
j j+ 1 j+ 2
2 2 2
1
Figure 3.1: Diagram schematically illustrating the ridgelet discretization of the

frequency plane (2-dimensional case). The circles represent the scales 2j (we have
chosen 0 = 2) and the dierent segments essentially correspond to the support of
a
dierent coe cient functionals. There are more segments at ner scales.
Now, the underlying object f^ must certainly satisfy speci c smoothness conditions in
order for its integrals on dyadic segments to make sense. Equivalently, in the original domain
f must decay su ciently rapidly at 1. In this chapter, we take for our decay condition that
f be compactly supported, so that f^ is band limited. From now on, we will only consider
functions supported on the unit cube Q = fx 2 Rd kxk1 1g with kxk1 =maxi jxi j thus
H = L2(Q).
Guided by the Littlewood-Paley theory, we choose to discretize the scale parameter a as
fa0gj j0 (a0 > 1 j0 being the coarsest scale) and the location parameter b as fkb0a0;j gkj j0 .
j
Our discretization of the sphere will also depend on the scale: the ner the scale, the ner
the sampling over Sd;1 . At scale aj0 , our discretization of the sphere, denoted )j , is an
"j -net of Sd;1 with "j = 0a;0 (j;j0) for some 0 > 0. We assume that for any j j0 , the
sets )j satisfy the following Equidistribution Property: two constants kd Kd > 0 must exist
s.t. for any u 2 Sd;1 and r such that j r 2
d;1 d;1
kd "r jfBu(r) \ )j gj Kd "r : (3.4)
j j
On the other hand, if r j , then from Bu (r) Bu ( j ) and the above display, jfBu (r) \
d;1
)j gj Kd . Furthermore, the number of points Nj satis es kd "2j Nj Kd "2j d;1.
Essentially, our condition guarantees that )j is a collection of Nj almost equispaced points
on the sphere Sd;1 , Nj being of order a(0j ;j0 )(d;1) . The discrete collection of ridgelets is
then given by
j=2
(x) = a0 (aj0 u x ; kb0 ) 2 ;d = f(aj0 u kb0aj0) j j0 u 2 )j k 2 Zg: (3.5)
In our construction, the coarsest scale is determined by the dimension of the space Rd .
De ning ` as supf 2k k 2 N and 2k < log2d2 g, we choose j0 s.t. aj00 +1 ` < aj00 +2 . Finally,
we will set 0 = 1=2 so that j = a;0 (j ;j0 ) =2.
Remark. Here, we want to be as general as possible and that is the reason why we do
not restrict the choice of a0 . However, in Littlewood Paley or wavelet theory, a standard
choice corresponds to a0 = 2 (dyadic frames). Likewise, and although we will prove that
there are frames for any choice of a0 , we will always take a0 = 2 in the analysis we develop
in the forthcoming chapters.
3.3 Main Result

We now introduce a condition that allows us to construct frames.
De nition 3 The function is called frameable if 2 C 1(R) and
X b ;j 2 ;j ;(d;1)
1 inf
j ja0
(a0 ) a0 >0
j 0
j b()j C jj(1 + jj) ; where > d;2 1 > 2 + .
This type of condition bears a resemblance to conditions in the theory of wavelet frames
(compare, for example, Daubechies, 1992, Page 55.) In addition, this condition looks like
a discrete version of the admissible neural activation condition described in the previous
section.
There are many frameable . For example, su ciently high derivatives (larger than d=2+1)
of the sigmoid are frameable.
Theorem 4 (Existence of Frames) Let be frameable. Then there exists b0 > 0 so that
for any b0 < b0 , we can nd two constants A B > 0 (depending on a0 b0 and d) so that,
for any f 2 L2 (Q) (where Q denotes the unit cube of Rd ),
X
Akf k22 jhf ij2 Bkf k22: (3.6)
2;d
The theorem is proved in several steps. We rst show:

Lemma 1
X Z X
jhf ij2 ; 2 1b0 R j j0u2
jf^(u)j2j ^(a0 j )j2d
;
2;d
v
uZ X
j
v
uZ X
2 u
1 t
R j j0 u2
jf^(u)j2 j ^(a0 )j2dju
;j
t
R j j0 u2
jf^(u)j2ja0 j j2j ^(a0 j )j2d
; ;
(3.7)
j j
The argument is a simple application of the analytic principle of the large sieve (Mont-
gomery, 1978). Note that it presents an alternative to Daubechies' proof of one-dimensional
dyadic a ne frames (Daubechies, 1992). We rst recall an elementary lemma that we state
without proof.
Lemma 2 Let g be a real valued function in C 1$0 ] for some > 0: then,
1 Z 1 Z
jg(=2) ; g(x)dxj 2 jg0 (x)jdx:
0 0
0 (a0 x). The ridgelet coe cient is then hf i = (Ru f j )(kb0 a0 ).
Again, let j (x) be aj= 2 j ;j
For simplicity we denote Fj = jRu f j j2 . Applying the lemma gives
a j Z (k+1=2)b0 a0 ;j
1 Z (k+1=2)b0 a0 ;j
Fj (kb0 a0 ) ; b 0
;j
Fj (b)db 2 (k;1=2)b a;j jFj0(b)jdb:
0 (k;1=2)b0 a;0 j 0 0
Now, we sum over k:

X a jZ
j(Ruf ;j
j )(kb0 a0 j
)2 ; b00 R j(Ruf j )(b)db
k Z
R j(Ruf j )(b) j j(Ruf ( j kRuf j k2k(Ruf (
j )0 )(b) db k
j )0 ) 2 :
Applying Plancherel, we have

X 1 Z
j(Ruf ;j
j )(kb0 a0 j ;2
)2 b0 R jf (u)j j (a0 )j d
^ 2 ^ ;j 2
k sZ sZ
21 R
j f^(u)j2 j ^(a0 j )j2 d
;
R
j f^(u)j2 ja0 j j2 j ^(a0 j )j2 d:
; ;
Hence, if we sum the above expression over u 2 )j and j and apply the Cauchy-Schwartz
inequality to the right-hand side, we get the desired result.
We then show that there exist A0 B 0 > 0 s.t. for any f 2 L2 (Q) we have
X Z 1 b 2 b ;j 2
A0 kfbk22 f (u) (a0 ) d B 0 kfbk22 (3.8)
j j u2 j ;1
X Z 1 b 2 ;j 2 b ;j 2
0
f (u) a0 (a0 ) d B 0 kfbk22 : (3.9)

j j0u2 j ;1
Thus, if b0 is chosen small enough, Theorem 4 holds.

3.4 Irregular Sampling Theorems

Relationship (3.8) is, in fact, a special case of a more abstract result which holds for gen-
eral multivariate entire functions of exponential type. An excellent presentation of en-
tire functions may be found in Boas (1952). In the present section, B12 (Rd ) denotes the
set of square integrable functions whose Fourier Transform is supported in $;1 1]d and
Qa (d) = fx kx ; ak1 `g, the cube of center a and volume (2`)d . Finally, let fzm gm2Zd
be the grid on Rd de ned by zm = 2`m.
Theorem 5 Suppose F 2 B12(Rd ) and ` < logd 2 with 2` an integer then 8a 2 Rd ,
X X
min jF (x)j2 c2` max jF (x)j2 (3.10)
m2Z d Qa+zm (`) m2Z d Qa+zm (`)
where c` can be chosen equal to 2e;`d ; 1.

In fact, a more general version of this result holds for any exponent p > 0. (In this case,
the constants ` and c` will depend on p). The requirement that =2` must be an integer
simpli es the proof but this assumption may be dropped.
Proof of Theorem 5. First, note that by making use of Fa (x) = F (x ; a), we just
need to prove the result for a = 0. The proof is then based on the lemma stated below,
which is an extension to the multivariate case of a theorem of Paley and Wiener on non-
harmonic Fourier series (Young, 1980, Page 38). Then with jF ( ;m )j = minQzm (`) jF (x)j
(resp. jF ( +m )j = maxQzm (`) jF (x)j), we have (using Lemma 3)
X 1 ; ` 2 X
jF ( m )j2
;
(1=2`)d (1 ; ` )2 kF k22 1 + ` m2Zd jF ( m )j :
+ 2
m2Zd
And 1;`
1+` = 2e;`d ; 1.
Lemma 3 Let F 2 B12(Rd ) and f g
m m2Zd be a sequence of Rd such that
supm2Zd k m ; m k < logd 2 then
1
X
(1 ; ` )2 d kF k22
;
jF ( m )j2 (1 + `)2 dkF k22 ;
(3.11)
m2Zd
for ` = e`d ; 1 < 1.

Proof of Lemma 3. The Polya-Plancherel theorem (see Plancherel and P'olya, 1938, Page
116) gives X
jF (m )j2 = ;dkF k22:
m2Zd
Let k denote the usual multi-index (k1 : : : kd ) and let jkj = k1 + + kd , k! = k1 ! : : : kd !
and xk = xk11 : : : xkdd . For any k, @ k F is an entire function of type . Moreover, Bernstein's
inequality gives k@ k F k2 kF k2 see Boas (1952, Page 211) for a proof. Since F is an entire
function of exponential type, F is equal to its absolutely convergent Taylor expansion.
Letting s be a constant to be speci ed below, we have
X @ k F (m
)(
F ( m ) ; F (m ) = k ! m ; m)k
jkj 1
X @ k F (m ) ks
jkj
= k ( m ; m) k :
jk j 1
! s j j
Applying Cauchy-Schwarz and summing over m, we get

X
jF ( m) ; F (m )j2 X X j@ kF!s(m2 k )j X k m ; mk!k s
k 2 2k 2k j j j j
1
j j
m Zd
2 m Zd k 1 2 j j k 1 j j
X dkF k22 X `2 k s2 k

; j j j j
jk j1 k !s
2jkj
jk j 1
k!
= ;d
kk ;
F 22 (ed s2 1)(ed`2 s2
1
; 1):
We choose s2 = 1` . If ` = e`d ; 1 < 1 then
X
jF ( m) ; F (m )j2 2` dkF k22 ;
m2Zd
and, by the triangle inequality, the expected result follows.
Let be a measure on Rd will be called d-uniform if there exist > 0 such that
(Qzm (`))=(2`)d . The following result is completely equivalent to the previous
theorem.
Corollary 1 Fix ` < log2 with
d 2` an integer. Let F 2 B12(Rd) and be an d-uniform
measure with bounds . Then

Z
c` kF k22 jF j2d c` kF k22: (3.12)
3.5 Proof of the Main Result

We notice that the frameability condition implies that
X b(aj0)
2
(i) sup
1j ja0 j 2j d; 1 < 1, and
Z a0
Xb j 2
(ii) sup
1j ja0 j 0
(a0 ) < 1
and respectively (i0 ) and (ii0 ) where b( ) is replaced by b( ).
For any measurable set A, let be the measure de ned as
X Z b ;j 2
(A) = (a0 ) 1A (u)d:
j j0u2 j
And similarly, we can de ne 0 by changing b( ) into b( ). Then,

X Z b 2 b ;j 2 Z b 2
f (u) (a0 ) d = f d
j j0 u2 j
and likewise for 0 .

Proposition 4 If is frameable, and 0 are d-uniform and therefore there exist A0 B 0 >
0 s.t. (3.8)-(3.9) hold.
We only give proof for the measure , the proof for 0 being entirely parallel. Let u
be the standard polar form of x. In this section, we will denote by +x (r ) the sets de ned
by +x (r ) = fy = 0 u0 0 0 ; r ku0 ; uk g: These sets are truncated cones. The
proof uses the technical Lemma 4.
Lemma 4 For frameable,

0 < inf +x (` 2k`xk ) sup +x (` 2k`xk ) < 1
kxk ` x `
k k
and respectively for 0 .

Proof. To simplify the notations, we will use for kxk and u for x=kxk. Let jx be de ned
by a;0 (jx ;j0 ) `= < a0 a;0 (jx ;j0 ) . Hence, if j jx , 8 2 f;1 1g, the Equidistribution
Property (3.4) implies that
(j;j0) a(0j;j ) `
kd a0 `
d;1 d;1
jfB"u(`=2) \ )j gj Kd
0
:
We have
X Z b ;j 2
(+x(` `=2)) = (a0 ) 1 x (``=2) (u)d
j j u2 j
X a(0j;j ) ` d;1 Z
0
kd
0
b(a;0 j ) 2 d
j jx j j+`
Z jj d;1 X j b(a00 j0 a0 jx )j2 d:
; ;
;j0
kd (a0 `)d;1 0 ja0 a0 j
j j+`
j jx d 1
; ; ;
j0
Now, since by assumption, ` , we have 8 j j 2 $ + `] à;0 (j0 +1) ja;0 jx j 2à;0 j0 .
We recall that à;0 (j0 +1) 1. Therefore,
;j0
X j b(a;0 j0 )j2
(+x(` `=2)) kd (a0 `)d;1 2` inf
j 0 0 a0
à;0 (j0 +1) j ;j
j2à0 0 j ;j 0
jd;1
;j0 d ;1
X b(a;0 j0 ) 2 j j
kd(a0 `) 2` 1jinfja :
0 0 a
j 0 0
;j 0 d;1
j j
Similarly, we have
X Z b ;j 2
(a0 ) 1 x (``=2) (u)d
j jxu2 j
X j b(a;0 j0 )j2
Kd(a0 ;j0
`)d;1 2d;1 2` sup
j ;j 0
jd;1
0 j 0 0 a0
à;(j0 +1) j ;j
j2à0 0
X b(a;0 j0 ) 2 j j
;j0 d ;1 d;1
Kd (a0 `) 2 2` sup
1j ja0 j 0 2Z a; j 0 d;1 : j j
0
We nally consider the case of the j 's s.t. j0 j < jx . We recall that in this case, we have
jfB"u(`=2) \ )j gj Kd, and thus
X Z Z X b jx;j ;jx 2
b(a;0 j ) 2 1 x (``=2) (u)d Kd ( a0 a 0 )
j0 j<jxu2 j j j+` j j<jx
X b j0 2
0
Kd 2` sup (a0 )
à;0 (j0 +1) j ;j
j2à0 0 j 0>0
X b(aj00 ) 2 :
Kd 2` sup
1j ja0 j 0 >0
The lemma follows.

Proof of Proposition 4. Now, we recall that fzm gm2Zd is the grid on Rd de ned by
zm = 2`m and we show that supm (Qzm (`)) < 1 and that inf m (Qzm (`)) > 0.
Again, we shall use the polar coordinates i.e. zm = m um . For m 6= 0, let zm0 be 0m um with
0m = m ; `=2. Then, we have that +zm0 (` `=20m ) = f0 u0 s.t j0 ; m j `=2 ku0 ; um k
`=20m g Bzm (`) Qzm (`). To see the rst inclusion, we can check that k0 u0 ; m um k2 =
(0 ; m )2 + 0 m ku0 ; um k2 . Then we use the fact that 0 =0m 5=3 and m =0m 4=3 to
prove the inclusion.
For m 6= 0, let fx(jm) g1j Jm with kx(jm) k ` s.t Qzm (`) 1j Jm +x(jm) (` `=2kx(jm) k)
and Tdm be the minimum number of j 's such that the above inclusion is satis ed. By
rescaling, we see that the numbers Tdm are independent of `. Moreover, it is easy to check
that if is chosen small enough, then any set +x(` `=2kxk) (where again kxk `) contains a
ball of radius . (Although we don't prove it here, maybe chosen equal to `=2.) Therefore,
the numbers Tdm are bounded above and we let Td = supm6=0 Tdm . It follows that for all
m 6= 0 (m 2 Zd ) we have
` ;
0 < inf +x (d 2kxk ) +zm0 (` `=20m )
kxk `
`

(Qzm (`)) Tn sup +x(` 2kxk ) < 1:
kxk `
Finally, we need to prove the result for the cube Q0 (`). In order to do so, we need to
establish two last estimates:
X Z
(B0 (`)) = j)j j b(a;0 j ) 2 d
j j fj j`g
0
(j ;j )(d;1)
Z X b ;j 2
kd a0 0
(a0 ) d
fj j`g j j0
Z
d;1 X j b(a0 a0 )j
;j 0 ;j 2
ja0 j j
0
= kd ; 0
d
fj j g ` j 0 0 a0 a0 j
;j 0 ;j0 d;1
j
Z
j X jjb j
;j 0 ;j0 2
kd j
;j0 d;1
a0 ( a 0 a0 ) d
f`=a0 j j`g
0
;j ;j0 d;1
j 0 0 a0 a0 j
X j b(a;0 j0 a;0 j )j2
kd 2`(1 ; 1=a0 ) (à0
0
;(j0 +1)
)d;1 inf :
;(j +1)
à0 0 ;j
j jà0 0 j0 0 ja0 j0 a0 j jd
; ; 0 ;1
Repeating the argument of Lemma 4 nally gives

X j b(a;0 j0 )j2
(B0(`)) kd 2`(1 ; 1=a0 ) (à0 ;(j0 +1)
)d;1 inf :
1j ja0
j0 0 ja0 j0 jd
; ;1
After similar calculations, we can prove that

X j b(a;0 j0 )j2
(B0 (`)) Kd 2`(à0 ;j0
)d;1
0
sup
à;(j0 +1) j ;j ja0 j0 jd 1 :
jà0 0 j 0 0
; ;
Again, let fxj g1j J with kxj k ` s.t Q0 (`) 1 j J +xj (` `=2kxj k) B0(`) and Td0 be
the minimum number of j 's needed. We then have

0 < (B0 (`)) (Q0 (`)) (B0 (`)) + Tn0 sup +x (` 2k`xk ) < 1:
kxk `
This completes the proof of Proposition 4.

Although we do not prove it here, we may replace the frameability condition by one
slightly weaker. For any traditional one-dimensional wavelet ' which satis es the su cient
conditions listed in Daubechies (1992, Pages 68-69), de ne via b( ) sgn( )j j d;2 1 (1 +
2); d;4 1 'b() then Theorem 4 holds for such a .
3.6 Discussion
3.6.1 Coarse Scale Re nements
In Neural Networks, the goal is to synthesize or represent a function as a superposition of
neurons from the dictionary DRidge = f(k x ; b) k 2 Rd b 2 Rg, the activation function
being xed. That is, all the elements of DRidge have the same pro le . Likewise, as we
wanted to keep this property, there is a unique pro le for all the elements of our ridgelet
frame. However, it will be rather useful to introduce a di erent pro le ' for the coarse-
scale elements. For instance, following section 2.3, let us consider a function ' satisfying
the following assumptions:
'^()=jj(d 1)=2 = O(1)
; and j'^()j=jj(d ;1)=2 c if jj 1,
'^() = O((1 + jj) 2).
;
Clearly, for a frameable , the collection
f'(ui x ; kb0) 2j=2 (2j uji x ; kb0 ) j 0 uji 2 )j k 2 Zg (3.13)
(where again )j is a set of \quasi-equidistant" points on the sphere, the resolution being
0 2;j ) is a frame for L2 (Q). The advantage of this description over the other (3.5) is the
fact that the coarse scale corresponds to j = 0 (and not upon some funny index j0 which
depends on the dimension). In our applications, we shall generally use (3.13) for its greater
comfort. As we will see, in addition to the frameability condition, we often require ' and
to have some regularity and to have a few vanishing moments.
We close this section by introducing a few notations that we will use throughout the
rest of the text. Indeed, it will be helpful to use the notation for '(ui x ; kb0 ). We will
make this abuse possible in saying that '(ui x ; kb0 ) corresponds to the scale j = ;1. For
j ;1 then, denote also by ;j the index set for the j th scale,
;j = f(j uji k) uji 2 )j k 2 Zg: (3.14)
(Note, nally, that 2; ;1 , (x) is in fact '(ui x ; kb0 ).)

3.6.2 Quantitative Improvements

Our goal in this chapter has been merely to provide a qualitative result concerning the
existence of frames of ridgelets. However, quantitative re nements will undoubtedly be
important for practical applications.
The frame bounds ratio. The coe cients a in a frame expansion may be computed
via a Neumann series expansion for the frame operator see Daubechies (1992). For com-
putational purposes, the closer the ratio of the upper and lower frame bounds to 1, the
fewer terms will be needed in the Neumann series to compute a dual element within an
accuracy of . Thus for computational purposes, it may be desirable to have good control
of the frame bounds ratio. Of course, the proof presented in section 3.5 provides only crude
estimates for the upper bound of the frame bounds ratio. The interest of this method is that
it uses general ideas, stated in section 3.4, which may be applied in a variety of di erent
settings. The author is con dent that further detailed studies will allow proof of versions
of Theorem 4 with tighter bounds. Although, such re nements are beyond the scope of the
present study, we next present a result that supports our certitude.
Actually, a reasonably simple calculation shows that the frame bounds ratio can be
made arbitrarily small in dimension 2. In this case, let us consider a 2-dimensional frame,
f'(x1 cos i + x2 sin i ; kb0) 2j=2 (2j (x1 cos ij + x2 sin ij ; kb0 )
j 0 ij = 2 0 i2;j k 2 Zg (3.15)
where at scale j the ij 's are equispaced points on the circle, with step 2 0 2;j (take 0;1
to be an integer).
Proposition 5 Let ( ) be the frame dened by (3.15) and suppose that the pair (' )
satises (2.7), say. Then,
X 1
jhf ij2 ; 2 b0 0 kf k2
^2 C (0 1 + b0 1)kf^k22
; ;
(3.16)

where the constant C depends at most upon ' and . It follows from (3.16) that the frame
bounds ratio is bounded by 2C (0 + b0 ) for the same constant C .
The result links the decay of the frame bounds ratio to the oversampling factor (0 b0 );1 .
The proof of the proposition may be found in the Appendix.
The oversampling. The redundancy of the frame that one can construct by this
strategy depends heavily on the quality of the underlying \quasi-uniform" sampling of the
sphere at each scale j . The construction of quasi-uniform discrete point sets on spheres
has received considerable attention in the literature see Conway and Sloane (1988) and
additional references given in the bibliography. Quantitative improvements of our results
would follow from applying some of the known results obtained in that eld.
Computations. Another area for investigation has to do with rapid calculation of
groups of coe cients. Note that if the sets )j for j j0 present some symmetries, it
may not be necessary to compute e for all 2 ;d many dual elements would simply
be translations, rotations and rescalings of each other. This type of relationship would be
important to pursue for practical applications.
3.6.3 Sobolev Frames

Actually, there is a trivial extension of Theorem 4 to the case of the Sobolev spaces H s. So,
let us consider the space H0s (Q) for s 0. Recall that
Z
kf k2Hs =
0
R
k
2
kk22s jf^(k)j2 dk
is an equivalent norm on H0s (Q).

To x the ideas, we will consider a frame of the type (3.13) as discussed in section 3.6.1.
Theorem 6 Suppose is frameable and that both b0 and 0 are small enough so that
f g is a frame for L2(Q). In addition, assume that ^()=jjs satises the conditions of
Denition 3. Then, we can nd two constants As Bs > 0 so that, for any f 2 H0s (Q)
X
Akf k2H0s 22js jhf ij2 B kf k2H0s : (3.17)
2;d
Proof of Theorem. In fact applying Lemma 1, one immediately sees that

X Z X
jj2sjf^(u)j2 j ^j2(2 j j2)sj d
1 ;j 2
22js jhf ij ; 2
2
b0 R j ;
2;d 0u2 j
v
uZ v Z X
21 u
t X
j jj
^ ;j 2 j ju
2s f^(u) 2 2(2;j 2)s d
jj j u
t j j 2s jf^(u)j2 j2;j j2 j (2 )j d:
^ ;j 2
R j 0u2 j
j R j 0u2 j j2;j j2s
Taking the coarse scale into account amounts to add (in the previous display) jf^(u)j2 j'^( )j2
P
to the quantity j 0u2 j j j2s jf^(u)j2 j2;j j2 j^j2(2;j j2)sj at each of its occurence. Now, the
;j 2
rest of the proof is absolutely identical to the one of Theorem 4. Our irregular sampling ap-
plies and gives the desired result provided that ^( )=j js satis es the conditions of De nition
3.
Remark. Anticipating the next chapter, suppose for instance that ' and satisfy the
conditions listed at the very beginning of Chapter 4, section 4.1 then our constructed L2
frame (Theorem 4) is also a frame for every H0s with s 0.
3.6.4 Finite Approximations

The frame dictionary D;d = f 2 ;d g may be used for constructing approximations of
certain kinds of multivariate functions. It would be interesting to know the \Approximation
Space" associated to this frame that is, the collection of multivariate functions f obeying
kf ; fN k2 CN ;r
(3.18)
where fN is an appropriately chosen superposition of dictionary elements

X
N
fN = iN iN : (3.19)
i=1
Based on obvious analogies with the orthogonal basis case, one naturally expects that
functions f of this type can be characterized by their frame coe cients, saying (3.18) is
possible if, and only if, the frame coe cients f g 2;d belong to the Lorentz weak lp space
lp1, with r = (1=p ; 1=2)+ . Work to establish those conditions under which the above
would hold is in progress.
It would also be interesting to establish results saying that (3.18) is equivalent to a weak
p
l condition on the frame coe cients even when the approximant (3.19) is not restricted
to use only 2 ;d . If one could establish that any continuous choices iN 2 ; would still
only lead to f with weak-lp conditions on frame coe cients, then one would know that the
frame system is really an e ective way of obtaining high-quality nonlinear approximations.
Chapter 4
Ridgelet Spaces
Littlewood-Paley theory is at the heart of the study of a wide range of classical functional
spaces (Stein, 1970 Frazier, Jawerth, and Weiss, 1991). In this chapter, we do not use
the classical d-dimensional Littlewood-Paley analysis to study the regularity of an object f ,
but rather use its one-dimensional version to study the regularity of the Radon transform
Rf . Doing so, we introduce a new scale of functional spaces that we believe are not studied
in classical analysis. Most of the analysis in this chapter serves to provide an external
characterization of these spaces together with an intuitive description.
As we will see in the next chapter, these new spaces are the ones that make sense for
neural nets.
4.1 New Spaces

We start out this section by listing a few classical de nitions/conditions and will assume
that they hold throughout the remainder of the chapter.
(i) ' 2 S (R).
(ii) supp '^ fj j 1g:
(iii) supp ^ f1=2 j j 2g:
(iv) '^( )=j j(d 1)=2 = O(1)
; and j'^()j=jj(d
;1)=2 c if j j 5=6,
j ^()j c if 3=5 jj 5=3.
39
CHAPTER 4. RIDGELET SPACES 40
These conditions are standard in Littlewood-Paley theory except for the rst part of (iv).
Instead of requiring j'^j to be 1 in the neighborhood of the origin, we want j'^j=j j(d;1)=2 to
be 1 near the origin. As we now recall, the reason for this modi cation originates in section
1.3. At times, we will choose pairs (' ) satisfying the additional condition (2.7), namely
2 X ^ ;j )j2
(v) jj'^(jd;)j1 + jj2(2
;j jd;1
= 1:
j 0
Recall that (v) allows to represent any function in L1 \ L2 (Rd ) as a semi-continuous super-
position of ridgelets (2.8), i.e.
Z Z
f/ hf (x) (u x ; b)i(u x ; b) + X 2j(d ;1)
hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(4.1)
where again the notation j stands for 2j (2j ).
Keeping in mind that Ru f denotes the Radon transform of f de ned by
Z
Ru f (t) = f (x)dx
ux=t
we now turn to the main de nition of this section.
De nition 4 For s 0 and p q > 0, we say that f 2 Rpq
s if f 2 L1 and

Ave kRuf 'kp < 1 and 2js 2j (d;1)=2 Ave kRuf kp 1=p 2 `q (N): (4.2)
u u j p
Note that we have de ned our space as a subset of L1 . The major reason is that we truly
understand the ridgelet transform for elements of L1 it is beyond the scope of this thesis
to extend the transform to distributions. In our de nition, it would be possible to drop the
membership to L1 but the discussion that follows would have been more complicated and
even more technical without further enlightening our business.
Next, we de ne
8 9
<X js j(d;1)=2 1=p q=1=q
kf kRpqs = Ave
u
kR u f 'k p + :j 0 2 2 Ave
u
kRuf j kpp (4.3)
and we will prove later that this is a norm for a suitable range of parameters s p and q.
There is an homogeneous version of our de nition where the norm (4.2) is replaced by
8 91=q
< 1=p =
kf kR_ pqs = :X 2js2j(d;1)=2 Ave
q
u
kRuf j kpp : (4.4)
j 2Z
To be consistent with the literature of Functional Analysis, we will call these classes R_ pq s .
Remark 1. It is not hard to show that the de nition is independent of the choice of the
pair (' ) provided it satis es (i)-(iv). Furthermore, it is not necessary to assume (ii)-(iii)
i.e., to assume that '^ and ^ are compactly supported. As we will show later, one can
get equivalent de nitions to (4.2)-(4.3) with the relaxed assumption that has a su cient
number of vanishing moments.
Remark 2. Like the Sobolev scales (respectively Besov scales) where a norm is de ned
based on the Fourier transform (respectively wavelet transform), what we have done here is
merely to de ne a norm based on the ridgelet transform. Let Rj (u b) the ridgelet coe cient
de ned by wj (u b)(f ) = hf (x) j (u x ; b)i = Ru f .j (b) for j 0 (v(u b)(f ) = hf (x) '(u
x ; b)i). The quantity kf kRpq
s may simply be rewritten as
Z 1=p 8
< X Z 1=p!q 9
=1=q
kf kRpqs = jv(u b)(f )jpdudb + : 2js2j(d;1)=2 jwj (u b)(f )jpdudb
j 0
which is a measure of the size of these coe cients.

Remark 3. For p q 1, there are of course obvious analogies with the Besov scale. In
fact if d = 1, the de nitions fully agree { at least within this range of parameters. In higher
dimensions, as we will see, the spaces Rpq s correspond to a di erent notion of singularity
Besov norms and analogs measure the smoothness of an object where the smoothness is
understood as the local regularity of the function. Here, our norm also corresponds to a
certain notion of smoothness, but in a di erent way. The scope of the remainder of this
chapter is to attempt to give an accurate intuition of what this new notion is about.
Remark 4. At rst, the de niton may seem rather internal. In fact, it is possible to
give an external characterization of these spaces in the case where p = q. As usual, let Ru f
denote the Radon transform of f . Then, roughly speaking, we have for q = p 1
kf kpRspp Ave
u
kRuf kpBspp d; = :
+( 1) 2 (4.5)
We explain further what we mean by (4.5). In fact, one can easily be convinced that the
s norm (4.3) dominates its homogeneous version (4.4). Moreover, it is clear that
Rpq
kf kpR_ spp Ave

u
kRuf kpB_ spp d; =
+( 1) 2
in the usual sense of norm equivalence (B_ spp+(d;1)=2 is the corresponding homogeneous Besov
space). On the other hand, it is trivial to see that
kf kpRspp C Ave
u
kRuf kpBspp d; = :
+( 1) 2
That is the sense of our sloppy equivalence (4.5). In this thesis, we are mainly concerned
by the representation, the approximation or the estimation of functions of d-variables that
are compactly supported, say, in the unit ball d . We should note that in this case, the
distinction we have just made is irrelevant as both norms (the homogeneous one and the
non-homogeneous one) are equivalent and, therefore, (4.5) holds in the usual sense of norm
equivalence.
Along the same lines of consideration, one could quickly remark that the space R2s2 (Rd )
is, indeed, the classical Sobolev space H s (Rd ) = B2s2 (Rd ) since it easily follows from our
de nition that kf kR2s2 (Rd ) is an equivalent norm for H s . One could go at length over the
properties of the new spaces Rpq s , investigating for instance embedding relationships or
interpolation properties, etc. However, the goal of this chapter is to show that the spaces
s characterize objects with a special kind of inhomogeneities rather than exploring these
Rpq
issues.
Our characterization highlights an interesting aspect: the condition does not require
any particular smoothness of the Radon transform along the directional variable u.
Example. We now consider a simple example to illustrate the di erence between
traditional measures of smothness and our new criteria. Let f (x) = 1fx1 >0g (2 );d=2 e;kxk2 =2 .
From a classical point of view, this function has barely one derivative (the rst derivative
is a singular measure). To quantify this idea, one can show that f 2 Bs11 if, and only if,
s < 1. On the other hand, this object is quite smooth in regard to (4:3). In fact, f 2 Rs11
as long as s < 1 + (d ; 1)=2. We now give a detailed proof of this claim for we will make
an extensive use of some details appearing in the argument.
Proposition 6
f (x) = 1fx1 >0g(2 );d=2 e;kxk2 =2 2 Rs11 i s < 1 + (d ; 1)=2:
Proof of Proposition. Using the external characterization that we developed (Remark
4), it is su cient to show that
<d ) Ave
u
kRuf k1B < 1
11
and d ) Ave
u
kRuf k1B_ = 1:11
Let u be a unit vector. Next, let Q be an orthogonal matrix whose rst two columns are u
and u0 , where u0 is a unit vector, orthogonal to u and belonging to the span of fe1 ug { e1
being the rst element of the canonical basis of Rd . We will choose u0 so that u0 e1 0.
Now, let x0 be the change of coordinates de ned by x0 = QT x. We then rewrite the Radon
transform as
Z Z
Ru f (t) = f (x)dx = f (Qx0)dx0 :
ux=t x01 =t
So that in our case, the above expression becomes

Z
Ru f (t) = 1fue1 x01 +u0 e1 x02 g (2 );d=2 e;kx0 k2 =2 dx02 :::dx0d
x01 =t
Z
;1=2 ;t2 =2
e 1fue1 t+u0 e1 x02 g (2 );1=2 e;x0 2 =2 dx02
2
= (2 )
To simplify the notations, let denote the Gaussian density ((t) = (2 );1=2 e;t2 =2 ) and /
R
be the cumulative distribution function, i.e. /(t) = t0 t (t0 )dt0 . With these notations, we
have
Ru f (t) = (t)/( p u e1 t 2 ):
1 ; (u e1 )
(In case u = e1 , the RHS is replaced by (t)1ft>0g .) We now use the following lemma.
Lemma 5 Let u be uniformly distributed on the unit sphere. Then the density of u1 = u e1
is given by
f (u1 ) = cd (1 ; u21 )(d;3)=2

where cd is a renormalizing constant.

Proof. We give a non-rigorous proof, similar in spirit to what can be found in physics
textbooks. We want to calculate the measure of the in nitesimal surface St = S d;1 \ft
u1 t + dtg. The density is clearly symmetric around 0 and, therefore, we may assume that
t 0 without loss of generality. Let Tu be the tangent hyperplane to the sphere at a point
u. Let u0 be the unit vector of Tu that belongs to the span of fu e1 g. We then clearly have
ju0 e1j2 = 1 ; ju e1j2 = 1 ; u21 that is, the cosine of the angle between any vector of the
tangent hyperplane and e1 is 1 ; u21 . Then, the in nitesimal measure of St is, to the rst
order of approximation, the one of a cylinder of radius (1;t2 )1=2 and sidelength dt=(1;t2 )1=2
therefore, its measure is proportional to (1 ; t2 )(d;2)=2 dt=(1 ; t2 )1=2 = (1 ; t2 )(d;3)=2 .
Corollary 2 Again, let u be uniformly distributed on the unit sphere and set v = u1= 1 ; u21.
p
The density of v is then given by
f (v) = c0d (1 + v2 );d=2 : (4.6)
The proof is a simple consequence of the change of variables formula.

End of proof of Proposition 6. With the same notation as in Corollary 1, recall that
Ruf (t) = (t)/(vt):

Using the rescaling properties of Besov spaces, it is not hard to show that for any > 1 we
have that there exist two constants C1 C2 > 0 (not depending on v) such that
C1 v;1 kRu f kB_ 11 and kRuf kB C2(1 + v

11
;1
):
Using Corollary 1, we have that for < d,

Z
Ave
u
kRuf k B11 C (1 + v;1 )(1 + v2 );d=2 < 1:
And conversely, for d,

Z
1=C v;1 (1 + v2 );d=2 Ave
u
kRuf kB_ 11
which is what needed to be established.
To close this section, we remark that

s equipped with the norm kf k s
Proposition 7 For s 0 and p q 1, the space Rpq Rpq
kf k1 + kf kRpqs is a Banach space.
The proof of the proposition is in the Appendix.
4.1.1 Spaces on Compact Domains

As our thesis deals essentially with compactly supported functions (see Chapters 2, 5 and
6), it is necessary to de ne properly the restriction of our new spaces to compact domains
as we will make an extensive use of their properties in the sequel. The de nition of this
restriction is classical.
De nition 5 Let be a bounded set of Rd . We dene
s () = ff 2 L () 9g 2 R s (Rd ) with g = f g:
Rpq (4.7)
1 pq j
Now, the following trivial proposition will serve as a preliminary for a subsequent result.
Proposition 8 The space Rpq
s () equipped with the norm
kf kRspq () = inf kgkRspq = inf kgkL (Rd) + kgkRpqs (Rd)

1
(4.8)
{ inmum taken over all g 2 Rpq s (Rd ) in the sense of (4.7) { is a Banach space.
Proof of Proposition 8. It is clear that Rpq s () is a linear space and that kf k s
Rpq () is a
norm. We need to check for completeness. Let ffngn 1 be a Cauchy sequence and assume
without loss of generality that
kfn+1 ; fnkRspq () 3 n n 0

;
(we take f0 = 0). Let fgn gn 1 2 Rpq

s (Rd ) be a corresponding sequence in the sense of
(4.7)-(4.8) such that
kgn+1 ; gnkRspq (Rd) 2 ;n

:
Then we have
X
1
g = nlim
!1
gn = (gn+1 ; gn )
n=1
with
k X(gn+1 ; gn)kRspq (Rd) X kgn+1 ; gnkRspq (Rd) X 2

1 1 1
;n
< 1:
n=1 n=1 n=1
Let f = gj . It follows easily from the last two equations that f is the limit element of fn .
Remark. Let 1 be a bounded domain such that 1 . Suppose moreover, that both
and 1 are nice in a sense that, say, their boundary is C 1. In our de nition (4.7)-(4.8),
one can restrict g 2 Rpq
s (Rd ) to be such that
gj = f and supp g 1 :
Doing so, we obtain the same space Rpq s () and an equivalent quasi norm. The reason is
that the multiplication of an element g 2 Rpq
s (Rd ) by a xed window w 2 C 1 is a bounded
0
operation from Rpq (R ) to itself. We will prove this later. In the application we have in
s d
mind, we shall consider the unit ball d as our domain and 2d as 1 .
4.2 R , A Model For A Variety of Signals

s
pq
In the previous section, we introduced a new scale of functional classes and presented some
basic properties. In this section, we will give a more intuitive characterization of these
spaces: we show how they model objects having a very special kind of singularities. To
describe these singularities, we introduce a de nition due to Donoho (1993).
De nition 6 Let > ;1=2. We say that : R ! R is a normalized singularity of degree
if (t) is C R on R n f0g where R = max(2 2 + ) and
j(t)j jtj 8t , and
dtdmm (t) (m + jj!) jtj ;m t 6= 0 m = 1 2 : : : R:
A normalized singularity is a smooth C R function which may or may not be singular at the
origin. Following Donoho (1993), we list a few examples of normalized singularities: jtj ,
jtj1ft > 0g , jtjw(t) where w(t) is a \nice" smooth function properly normalized, etc.
Additional examples are given in the reference cited above.
We make use of the de nition to construct a class of functions whose typical elements
are of the form (u x ; t).
De nition 7
SH = fX aii(ui x ; ti) X jaij 1 kik1 1g (4.9)
where the i 's are normalized singularities of degree .

Our class is a kind of projection model. It is synthesized out of ridge functions that may or
may not be singular across hyperplanes (the parameter measuring the type of discontinu-
ities allowed). Therefore, the model is meant to represent objects composed of singularities
across hyperplanes. There may be an arbitrary number of singularities which may be lo-
cated in all orientations and positions. The conditions on the coe cients ai 's and on the
L1 -norm of the i 's guarantee that the elements of SH have nite energy. Now, the main
theorem of this section allows us to bracket this very intuitive class between two spaces of
s .
the type Rpq
Theorem 7 There exist C1 C2 > 0 such that
R1+
11
+(d;1)=2 (C )
1 SH R1+1 +(d
1
;1)=2
(C2 ) (4.10)
when these spaces are restricted to the unit ball d .

(The notation R1+ +(d;1)=2 (C ) stands for the set of elements of R1++(d;1)=2 ( ) such
11 1 11 d
that kf kR1+
11
+(d;1)=2 C1 and similarly for R 1++(d;1)=2 (C ).)
11 2
s s
For any s the di erence between R11 and R11 is very subtle. Of course, we have
that k:kR1s1 is dominated by k:kR1s1 but actually there isn't much gap between these two
spaces. (For instance, we shall see that they have the same approximation bounds.) So, the
theorem really identi es a very natural class of function whose typical elements may have
some discontinuities accross hyperplanes with classes of our new scale.
Remark 1. Note that the result provides some intuition about the smoothness parameter
s. As the functions get smoother, i.e., increases, the parameter s increases accordingly.
The proof of the theorem is fairly involved and requires several steps and composes the
remainder of this section. At the heart of the argument, there is the atomic decomposition
of the space Rs11 .
4.2.1 An Embedding Result

In Chapter 2 (Theorems 1 and 3), we derived a reproducing formula and established the
sense in which it hold. Our condition requires that the function f one wishes to represent
as a continuous (or semi-continuous) superposition of ridge functions must belong to the
space L1 \ L2 (Rd ). To make use of our reproducing formula, it is most useful to know for
which value of the smoothness parameter s is the space R1s1 (d ) embedded in L2 (d ). The
following lemma answers this question.
Lemma 6 Suppose that f 2 Rs11 (d) for s d=2. Then f 2 L2 (d) and
kf k2 C kf kRs (d)
11
(4.11)
Moreover, the result is sharp in a sense that for s < d=2, there are elements in R1s1 (d )
that are not square integrable.
Proof of Lemma. It is su cient to prove the desired norm inequality (4.11) in R1s1 (Rd )
and such that g 2d . Indeed, let f be in R1s1 (d ) then for any g 2 R1s1 (Rd ) such that
gjd = f and supp g 2d , we have
kf k2 kgk2 C kgkRs (Rd):
11
And taking the in mum over the g's would give the desired result.
Step 1. We then prove the result in the case where g 2 L2 . We know that
kgk22 / Ave k R g ' k 2 + X 2j (d ;1)

Ave kRug j k22:
u u 2 u
j 0
Now it is clear that Ru g ' (resp. Ru g j ) is an analytic function of exponential type 1

(resp. 2j ) and, therefore,
kRug 'k2 C kRug 'k1 and kRug j k2 C 2j=2kRug j k1:

Consequently,
Z8
< X
9
=
kgk22 C :kRug 'k1kRug 'k2 + 2j(d;1)2j=2kRug j k1kRug j k2 du:
j 0
But for any u 2 S d;1 , we have kRu g 'k2 C kgk2 and kRu g j k2 C kgk2 for some
positive constant C , see for example Candes (1996). Thus the inequality becomes
Z8
< X j(d;1) j=2
9
=
kgk2 C :kRug 'k1 + 2 2 kRug j k1 du:
j 0
The last inequality holds for any g satisfying the conditions speci ed above and, therefore,
we can conclude to the validity of (4.11).
Step 2. We nish the proof by density. Suppose g is in R1s1 (Rd ) and that supp g 2d ,
we show that
kgk2 C kgkRs (Rd):

11
(4.12)
So, let + be a non-negative, radial, in nitely di erentiable and compactly supported func-
R
tion in Rd such that + = 1. Further, let be its Radon transform (because of spherical
symmetry, the Radon transform does not depend on u). We de ne + to be ;d +(= ) and
similarly = ;1 (= ). We rst show that
lim kg + ; gkR1s1 (Rd ) = 0:

!0
(4.13)
We have Ru (g + ) j = Ru g j (and similarly for the coarse scale, i.e. the convolution
with '). Now, it is immediate to check that for any j u and b, (Ru g j )(b) !
(Ru g j )(b) as ! 0 (and similarly when j is replaced by '). Further, we show that
lim kg + kR1s1 (Rd ) = kgkR1s1 (Rd ) in other words
8 9 8 9
< X = < X =
lim Ave kR g ' k1 + kRu g j k1 = Ave
!0 u : u
kR g 'k1 + kRug j k1 :
u : u
j 0 j 0
(4.14)
By Fatou's lemma, we get that lim inf kg + kR1s1 (Rd ) kgkRs (Rd). Conversely, since
11
kRug j k1 kRug j k1kk1 = kRug j k1

we have that lim sup kg + kRs (Rd ) kgkRs (Rd ) , which proves (4.14). Finally, Sche 'e's
11 11
lemma establishes the claim (4.13).
We recall that since g is assumed to be in L1 , then g + is in L2 . Now, from Step 1,
we have
kg +k2 C kg +kRs 11
and thus we deduce that fg +g is a Cauchy sequence in L2 and therefore converges in
L2 . On the other hand it is trivial that g + converges to g in L1 . Then, g + converges
to g in L2 and, therefore, we have proved (4.12). The lemma follows.
This lemma has a rather useful corollary.
Corollary 3 The space Rs11(d ) equipped with the norm
kf kRs (d) = inffkgkRs (Rd) 9g 2 Rpqs (Rd) with g = f g
11 11 j (4.15)
is a Banach space.
Proof of Corollary. For compactly supported functions, the L2 norm dominates the L1 norm.
Now from the previous lemma it is clear that the norm (4.15) and the one of Proposition 8
are equivalent. This proves the claim.
Actually, it is not hard to see that one can extend Lemma 6 and its corollary. For p 2,
s (d ) L2 (d ) as long as s > 1=p ; 1=2 (the argument being essentially the
we have Rpq
same as the one spelled out in the proof of Lemma 6) with continuous injection. Moreover,
for p > 2, the same is true as long as, this time, s > 0. Although we do not prove these
results here, we will use them in Chapter 5.
4.2.2 Atomic Decomposition of R1s1 (d )

We start out with a lemma that helps to establish our decomposition result.
Lemma 7 The set f'(u x ; b) 2;js2j d;2 1 j (u x ; b) j 0 u 2 S d;1 b 2 Rg is bounded
in R1s1 (d ).
Proof of Lemma. Let w be in C01(2d ) such that its restriction to the unit ball is 1
(wjd = 1). It is enough to show that, say, the set fw(x)'(ux;b) w(x)2;js 2j 2 j (ux;b)g
d;1
is bounded in R1s1 (Rd ). The analysis is greatly simpli ed if one replaces the window w by
the d dimensional Gaussian density /d in fact, it is su cient to prove the lemma with the
gaussian window. Again, the reason is that the multiplication of an element in R1s1 (Rd )
with a xed C01 function is a bounded operation.
Lemma 8 Suppose that f 2 R1s1 and let be a bounded C 1 function together with its
derivatives. Then there exists a constant c such that
kfkRs ckf kRs :

11 11
We postpone the proof of this intuitive lemma. As we shall see in the next chapter, we can
prove Lemma 7 directly without the help of the intermediate Lemma 8.
End of proof of Lemma 7. The norm k/d (x) j (u x ; b)kR1s1 being clearly invariant,
we will without loss of generality assume that u = e1 . Finally g will denote Ru f/d j (u
x ; b)g. Following the discussion presented in the last chapter, it is su cient to show that
there exists a positive constant C such that
Ave kg k 1 d;1)=2 C 2js2;j

u B1s+(
d;1
2 : (4.16)
In the remainder proof, will stand for the quantity s + (d ; 1)=2. Reproducing the
calculations from the last chapter, we rapidly get
Z q
g (t) = (t) (y)2j (2j (u1 t ; 1 ; u21y ; b)) dy
p p
We recall that v denotes the ratio u1 = 1 ; u21 , so that 1 ; u21 = (1 + v2 );1=2 .
case 1. a = 2j (1 + v2 );1=2 1. Then g can be written as
g (t) = (t)(1 + v2 ) ( a ) (vt ; (1 + v2 )1=2 b):

Next, we use the following result from Triebel (1983)$Page 127]:
kf gkB kf kB kgkB 1 :
11 11
0
1
Applying this inequality gives
k akB kkB ka (a )kB 1

11 11
0
1
C kkB C 11
where the constant C depends only on . Therefore,
kg kB C (1 + v2)1=2(1 + v
11
;1
):
case 2. 2j (1 + v2 );1=2 1: We rewrite things slightly di erently.

Z q Z
g (t) = (t) (y)2j (2j (u1 t ; 1; u21 y))dy = (t) (y)2j (2j (u1 t ; y))dy = (t)g (t)
p
where obviously, = 2j 1 ; u21 = 2j (1 + v2 );1=2 1 and g (t) = (y) (t ; y)dy. For k
R
a non-negative integer, we have
Z
j j
g(k) (t) = (y) ; y)dy
(k) (t
Z
(y)j (k) (t ; y)jdy
Z
C (y)(1 + jt ; yj) mdy for some positive m
;
Z Z !
= C (y)(1 + jt ; yj) dy +
m ;
(y)(1 + jt ; yj) ;m
dy
t y > t =2 t y t =2
Z j; j jj
Z
j ; jj j
!
C
jt;yj>jtj=2
(y)(1 + jtj=2) dy + ;m
jt;yjjtj=2
j(y)kdy
Z !
C (1 + jtj=2) + y > t =2 j(y)kdy
;m
C ;(1 + jtj) m + (1 + j 1tj) m

j j jj
; ; ;
In the case where 1, the last expression implies that jg(k) (t)j C (1 + jtj) m : Therefore, ;
g 2 S (R) and is bounded in almost every known space and particularly B11 .
Now, the rescaling properties of Besov spaces gives that for > 0, there exists some
positive constant C such that for j j 1,
kf ( )kB C j j kf kB
11 11
for any f 2 B11 . It is then obvious that, from g (t) = 2j (t) g (2j u1 t), we can deduce
kg kB C 2j(+1)
11
for some positive constant C .

Combining the two estimates from Case 1 and 2, we get that for > 1,
Z
Ave kg k
u B11
/ kg kB (1 +dvv2)d=2
Z !
11
C k g k
dv + Z k g k
dv
B B
j 1 1 (1 + v2 )d=2 jvj>2j 1 1
(1 + v2 )d=2
Zjvj2 Z !
C (1 + jvj ) ;1 dv + 2 j dv
jvj2j (1 + v2 )(d;1)=2 jvj>2j (1 + v2 )d=2

C 2j 2;j(d;1) + 2j 2;j(d;1) :
Now if we recall that = s + (d ; 1)=2, we have proven (4.16). The story is the same for
the coarse scale ridgelets '(u x ; b) and thus the proof of the lemma is complete.
Theorem 8 (Atomic Decomposition of R1s1(d )) Suppose that f 2 R1s1(d ) for s
d=2. Then, there exist numerical sequences such that
X
1
f (x) = k 2;jk s2jk d;2 1 jk (uk x ; bk) (4.17)
k=1
with the convergence in R1s1 . Moreover,
X
1
jk j 2(2 );d kf kR1s1 :
k=1
(We recall the convention we adopted in section 3.6.1: for jk = ;1, jk (uk x ; bk ) stands
for '(uk x ; bk ).)
We follow Frazier, Jawerth, and Weiss (1991) since our proof is a minor adaptation of
their argument. In order to prove our claim, we shall use a lemma from functional analysis
that can be obtained from the Hahn Banach theorem:
Lemma 9 Let K be a closed, convex and bounded subset of a Banach space B over the
reals. If x does not belong to K , then there exists a continuous real valued linear functional
such that supw2K l(w) < l(x).
Proof of the Theorem. So, let f be in R1s1 (d ) for s d=2. We then know that
f 2 L1 \ L2 (d ) and that we have
Z X Z
f/ hf (x) '(u x ; b)i'(u x ; b) + 2j (d;1) hf (x) j (u x ; b)i j (u x ; b)dudb
j 0
(4.18)
R
where the equality means has the following sense: S0 (f )(x) = hf (x) '(u x ; b)i'(u x ;
R
b)dudb 2 L2, for any j 0, +j (f )(x) = 2j(d;1) hf (x) j (u x ; b)i j (u x ; b)dudb 2 L2 ,
and
X
SJ (f ) = S0(f ) + +j (f ) ! f in L2 as J ! 1
0j J ;1
In addition, the convergence does also take place in R1s1 (d ). To see that, suppose g 2
R1s1 (Rd ) such that gjd = f . With the same notation as above, we have
X
g = S0 (g) + +j (g):
j 0
The convexity of the norm implies

Z
k+j (g)kRs (d)
11
2j (d;1) jwj (g)(u b)jdudb sup
ub
k j (u x ; b)kRs (d
Z
11
C 2j(d 1)2js 2 j d; jwj (g)(u b)jdudb

; ; 2
1
Z
= js
C2 2 j d;
2 jwj (g)(u b)jdudb
1
where the last inequality is a consequence of Lemma 7. We then immediatly have that
Z
kg ; SJ (g)kRs (d) C X C 2js2j d; jwj (g)(u b)jdudb ! 0
11
2
1
as J ! 1
j>J
by de nition of the norm k:kR1s1 (Rd ) . Since g (resp. SJ (g)) coincides with f (resp. SJ (f ))
on the unit ball, the convergence is established.
We are now in position to nish up the proof of the theorem. Let A be the set de ned
by
A = f'(u x ; b) u 2 Sd;1 b 2 Rg f2;js 2j d;2 1 j (u x ; b) j 0 u 2 Sd;1 b 2 Rg:

Lemma 7 implies that A is a bounded subset of R1s1 (d ). Let ; = f mk=1 k fk k
P
P
0 mk=1 k = 1 fk 2 A m 2 Ng and K = ; then K is closed, convex and bounded.
Suppose kf kR1s1 (d ) (2 )d . For every l in the dual of R1s1 (d ), we have
Z
l(f ) = (2 ) ;d
v(g)(u b)l('(u x ; b))
X Z
+
d;1
2j (d;1) 2js 2;j 2 wj (g)(u b)l(2;js 2 d;2 1 j (u x ; b))dudb
j 0
Z
(2 ;d
) sup l(h)
h2A
j
v(g)(u b) j d;1
+ 2js2j 2 jwj (g)(u b)jdudb
(2 kk
);d sup l(h) g R1s1
h2A
where again g satis es the condition listed above. Taking the in mum over g yields
l(f ) (2 );d sup l(h)kf kR1s1 (d ) sup l(h):

h 2A h2A
Hence, by Lemma 9, f 2 K . We can then conclude that
X
m
x ; b1k ) + f1(x)
1
f (x) = 1k 2;jk1s 2jk1 d;2 1 1
jk1 (uk
k=1
where f1 2 R1s1 (d ), kf1 kR1s1 (d ) 12 (2 )d and

Pm 1 = 1. We then iterate this process
1
k=1 k
with f replaced by 2f1 , and so on, obtaining
X
1 X
mn
f (x) = 2 ;n+1
nk2;jkns2jkn d;2 1 n
jkn (uk x ; bnk) + f1(x)
n=1 k=1
and
X
1 X
mn X
1
2;n+1 nk = 2;n+1 = 2:
n=1 k=1 n=1
This can be rewritten in the form
X
1
f (x) = k 2;jk s 2jk d;2 1 jk (uk x ; bk )
k=1
P
with 1 k=1 jk j = 2. The general case is now immediate.
To cut a long story short, if we are given a function f on, say, the unit ball d which
can be extended to a function in R1s1 in the whole space, then we have
X
1
f (x) = k 2;jk s 2jk d;2 1 jk (uk x ; bk ): (4.19)
k=1
4.2.3 Proof of the Main Result

End of proof of Theorem. Now, s = 1 + + (d ; 1)=2 with ;1=2. We can therefore
rewrite (4.19) as
X
1 X
1
f (x) = k '(uk x ; bk ) + k 2;jk (2jk (uk x ; bk )):
k=1:jk =;1 k=1:jk 0
Now that we have been able to establish the atomic decomposition formula, the rest of the
proof is almost identical to the one of Donoho (1993). We reproduce it here for the sake of
completeness. We rst, suppose that ;1=2 0. De ne
m
!
= max sup sup dtd m (t) (m + jj!) jtj m k k1
;
: (4.20)
t6=0 0mR
As 2 S (R) is regular, < 1. It follows that with = = , we have the representation

a (a(t ; b)) = a (a(t ; b))
; ;
where is a normalized singularity of degree and by construction kk1 1. We note that

if is a normalized singularity of degree , then a; (at) is again a normalized singularity
of degree . Hence
a; (a(t ; b)) = ~ (t ; b)

where ~ is a normalized singularity of degree with k~ k1 a ;1; 1 if say, a 1.
Similarly, de ning 0 as in (4.20) with ' in place of gives
'(t) = 0 0 (t)
where again, 0 is a normalized singularity of degree so that k0 k 1. From (4.19), one
can now check that we may write
X
f= ak k (uk x ; bk )
k
where ak = k if jk 0 and ak = 0k if jk = ;1. Now
X
jak j = X jk j + X jk j max( )2(2
0 0
);d kf kR11+1+(d;1)=2 :
k k:jk =;1 k :j k 0
Hence setting C1 = 12 max( 0 );1 (2 )d gives the desired result.

In the case > 0, we argue not that is itself a normalized -singularity, but instead
that it is a limit of such singularities,
X
1
= wl l
l=1
with kl k1 1 and = Pl jwlj { and a similar decomposition for ' exists with say, 0
instead of . The representation (4.19) may be rewritten as

X X
f (x) = k 2;jk wl l (2jk (uk x ; bk ))
X
k X l
= k wl ~l (uk x ; bk )
X
k l
= ai~i (ui x ; bi )
i
where the index i runs through an enumeration of doubles (k l), ai = k wl , bi = bk , ui = uk
and ~i ( ; bi ) = 2;jk l (2jk ( ; bk )). We remark that k~i k1 2;jk (1+) so that kl k1 1.
The reasoning continues as before giving the inclusion for C1 = 12 (2 )d max( 0 );1 :
We now turn to the proof of the right inclusion. So let f be the superposition of
normalized singularities of degree i.e.,
X
f (x) = aii(ui x ; bi)
i
P
with i jai j = 1. We are going to show that f is bounded in R11+1+(d;1)=2 (d ) by some
constant C . Of course, it is enough to establish that (u x ; b) is uniformly bounded
in R11+1+(d;1)=2 (d ) for any -singularity . Again, it is therefore su cient to show that
(u x ; b)/d (x) (/d standard gausssian kernel) is bounded in R11+1+(d;1)=2 (Rd ). The norm
being clearly invariant by rotation, we will work with (x1 ; b) that we will denote by a
slight abuse of notations. We want to prove
Ave
u
kRu(/d) 'k1 C and sup
j 0
2j (+d) Ave
u
kRu(/d) j k1 C:
The basic calculation that we have already used gives
Z
Ru (/d )(t) = (t) (y)(u1 x1 ; (1 ; u21 )1=2 y ; b) dy
Z
= ;
(t)(1 u21 )=2 (y)~ (u1 x1 =(1 u21 )1=2 ; ; y ; b=(1 ; u21)1=2) dy
= ;
(t)(1 + v2 );=2 ( ~ )(vt (1 + v2 )1=2 b)
where we recall that v = u1 =(1 ; u21 )1=2 . Here, ~ is of course a normalized singularity of
degree . Suppose now that jvj 2j we then have that
k ~kB_ 1 C kkB_ 1 k~ kB_ 1

1+
1
0
1
1+
1
and a rescaling argument gives
k ~(v ;(1 + v2)1=2bkB_ 1 C jvjkkB_ 1 k~ kB_ 1 :

1+
1
0
1
1+
1
Therefore, it can be shown that
kRu(/d)kB_ 1 C kkB_ 1 k~kB_ 1 :

1+
1
0
1
1+
1
Now, a calculation in Donoho (1993) shows that the homogeneous Besov norm k kB_ 11+1 is
uniformely bounded over all the normalized singularities of degree . Hence, for jvj 2j
we have
kRu(/d) j k1 C 2 ;j (1+)
:
For jvj 2j , the picture is slightly di erent. Let s be greater than +d, e.g take s = 1++d
k ~ kB_ s1 C kkB_ d1 k~ kB_ 1 :

1 1
1+
1
From this, we deduce that (rescaling property)
kRu(/d)kB_ s1 C (1 + v2)d=2 kkB_ d1 k~ kB_ 1

1 1
1+
1
which implies
kRu(/d) j k1 C 2 ;j (+1+d)
(1 + v2 )d=2
because of a previous observation. Averaging our two estimates yields (recall that the
measure is / dv=(1 + v2 )d=2 )

Z 2 )d=2 Z1 !
Ave kRu(/d) j k1 C 2 ;j (+1+d) (1 + v dv + 2 ;j (1+) dv
u j (1 + v 2 )d= 2 j (1 + v2 )d=2
;j(+1+d) j jvj2;j(1+) ;j(d;1) jvj>2
C 2 2 +2 2
C 2;j (+d)
which is what needed to be shown. Of course, we also have for the coarse scale ( denoting
the 1-dimensional gaussian density)
Ave
u
kRu/d 'k1 Ave
u
kRu/dk1k'k1 Ave
u
k/dk1k'k1
= Ave
u
k( ; b)k1k'k1 C k( ; b)k1 C:
Theorem 7 is now completely proved.
Chapter 5
Approximation
This chapter shows how we can apply the ridgelet transform to derive new approximation
bounds: that is, we derive both a lower bound and an upper bound of approximation for
our new functional classes. These two bounds are essentially the same and show that in a
sense, the ridgelet-dictionary is optimal for approximating functions from these classes.
5.1 Approximation Theorem

In order to state our approximation result, we need a bit of terminology. Suppose we are
given a dictionary D = fg 2 #g. For a function f , we de ne its approximation error by
N -elements of the dictionary D by
X
N
infN infN kf ; i gi kH dN (f D): (5.1)
(i )i=1 (i )i=1 i=1
Suppose now that we are interested in the approximation of classes of functions. We char-
acterize the rate of approximation of the class F by N elements from D by
dN (F D) = sup dN (f D) (5.2)

f 2F
that is, the worst case error over F .

Next, suppose we are given a frame ( ) 2 ;d . We know that any element of, say L2 (d )
61
CHAPTER 5. APPROXIMATION 62
might be represented as
X
f= hf i ~ = X hf ~ i : (5.3)
2;d 2;d
To simplify, will denote the ridgelet coe cient hf i. Finally, f~N will denote the N -
term dual-ridgelet expansion where one only keeps the dual ridgelets corresponding to the
N largest ridgelet coe cients. That is,
X
f~N = 1fj j jj(N ) g
~ : (5.4)
2;d
In this framework, we are interested in the L2 approximation of F = Rpq s (C ) =

ff kf kRpqs C g over, say, the unit ball d. That is to say, we measure the approxi-
mation error in L2 (d ). We have the following result:
Theorem 9 Consider the class Rspq (C ) and assume s > d(1=p ; 1=2)+ .
(i) For any \reasonable" dictionary D
s (C ) D) K N ;s=d
dN (Rpq (5.5)
1
where the constant K1 depends at most upon s p q.

(ii) Simple thresholding achieves the optimal rate i.e.
kf ; f~N kL (d) K2N s=d
2
;
(5.6)
where again K2 might depend on s p q.

Essentially, what the theorem says is that no other \reasonable" dictionary exists with
better approximation estimates for the classes Rpqs (C ), than what can be obtained using
DDual;Ridge. We must clarify, however, the meaning of the word \reasonable": when con-
sidering lower approximation bounds by nite linear combinations from dictionaries, we
remark that we must only consider certain kinds of dictionaries. We quote from Donoho
and Johnstone (1995). \If one allows in nite dictionaries (even discrete countable ones),
we would then be considering dictionaries D = fg 2 #g enumerating a dense subset of
s (C ).), and which can perfectly reproduce any
all common functional classes (including Rpq
f with a singleton: d1 (f D) = 0". Thus when we say \reasonable dictionary," we have in
mind that one considers only sequences of dictionaries whose size grow polynomially in the
number of terms to be kept in the approximation (5.1){(5.2).
Remark. In Chapter 4, we considered a very natural class SH of objects having a special
kind of singularities (De nition 7) and showed that they almost correspond to balls in Rpqs
(Theorem 7) it follows from Theorems 7 and 9 that these classes have the same lower and
upper bounds, as the L2 approximation rate (5.5)-(5.6) does not depend on the parameters
p and q. Further, our result implies that thresholding the dual-ridgelet expansion is optimal
for approximating objects from these classes.
5.2 Lower Bounds

The proof of the lower bound uses an argument which is rather classical in statistics and
perhaps in rate distortion theory. The goal is to construct a \fat" hypercube which is
embedded in the functional class you wish to approximate. (In statistics for example, this
technique is often referred to as the construction of Assouad's cube.) More speci cally, a
result from Donoho and Johnstone (1995) shows how the hypercube embedding limits the
approximation error.
Theorem 10 Suppose that the class F contains embedded hypercubes of dimension N ()
and side , and that
N () K ;2=(2r+1) 0 < < 0 :
Let Dk be a family of nite dictionaries indexed by k = k0 k0 + 1, ... obeying the size

estimate #Dk Bk . Let (t) be a polynomial
dN (F D (N ) ) K 0 N ;r :
In our situation, the construction of embedded hypercubes involves properties of the
frame. Roughly speaking, for a xed scale j , a subsequence of the frame ( ) 2;j properly
rescaled is an hypercube of the desired dimension. In order to prove this fact, we need rst
to establish a number of key estimates about the decay of the kernel h 0 i. We will
notice that some of these key estimates support a number of claims that have been made
in Chapter 4.
5.2.1 Fundamental Estimates

The purpose of this very technical section is then to show that the kernel h 0 i for
0 2 ; is \almost diagonal." As we will see, the fact that h 0 i decays rapidly as the
\distance" between the indices ( 0 ) increases is a crucial fact of our analysis.
Of course, h 0 i makes no sense, as the ridgelets are not square-integrable. So as
usual, we take a xed window w in S (Rd ) and look at the inner product with respect to
the signed measure w. The goal of this section is to give upper bounds on the kernel:
Z
Kw ( 0 ) = j (u x ; b) j 0 (u
0
x ; b )w(x) dx:
0
(5.7)
(Recall the convention about our notations introduced in section 3.6.1: ;1 (u x ; b) stands
for '(u x ; b).) Let Q be an orthogonal change of coordinates (x = Qx0 ) such that
p
u0 x = x01 and u x = (u u0)x01 ; 1 ; (u u0)2 x02 :
p
Finally, we set v to be u u0 = 1 ; (u u0 )2 . (We recall that for a xed u0 , and a uniform
distribution of u on the sphere, the density of v is proportional to (1 + v2 );d=2 .) With our
notations, the kernel (5.7) can be rewritten as
Z
Kw ( 0 ) = j ((1 + v2 );1=2 (vx01 ; x2) ; b)
0
j 0 (x1
0
; b ) w(Qx ) dx
0 0 0
Z
1=2 (vx ; x ) ; b)
= j ((1 + v2 ); j 0 (x1 ; b ) wQ (x1 x2 ) dx1 dx2
0 0 0 0 0 0 0 0
1 2
R
where wQ (x01 x02 ) is of course w(Qx0 ) dx03 : : : dx0d . It is trivial to see that wQ belongs to
S (R2) and that for all Q, and any n1 n2, there is a constant C (depending on w, n1 n2,
and m) with
@ n1 +n2 w (x x ) C (1 + jx1j + jx2j) ;2m
C (1 + jx1j) ;m
(1 + jx2 j);m :
@xn1 1 @xn2 2 Q 1 2
In this section we will assume that ' and satisfy a few standard conditions: namely,
both ' and are R times di erentiable so that for every nonnegative integer m there is a
constant C (depending on m) so that
dn (t) C (1 + jtj) ;m
dtn
for some constant C (depending only upon m and n) { and similarly for '. Further, we will
suppose that has vanishing moments through order D.
Lemma 10 Assume j 0 j and suppose n < min(R D). Then, for each n m 0, there is
a constant C (depending on n and m) so that:
(i) Suppose 2j (1 + v2 );1=2 1, then
jKw ( )j C 2
0 ;j 0 n ;jn
2 (1 + v2 )n+1=2 (1 + jb0 j);m (1 + jvb0 ; (1 + v2 )1=2 bj);m :
(ii) Suppose 2j (1 + v2 );1=2 1, then

2 1=2
!;m
jKw ( )j C 2
0 ;j 0 n
2j (n+1) (1 + jb j)
0 ;m
1 + 2j b ; (1 + vv ) b
0
:
Proof of Lemma.
Case (i). The change of variables x01 = x1 x02 = u1 x1 ; u2 x2 allows rewriting the kernel
as
Z
Kw ( ) =0
R2
j (x2
0
; b) j 0 (x1
0
; b )wQ(x1 vx1 ; (1 + v2)1=2 x2) dx1dx2:
0 0 0 0 0 0
Now let w~Q be the function de ned by w~Q (x01 x02 ) = wQ (x01 vx01 ; (1 + v2 )1=2 x02 ) dx01 dx02 . It
is fairly clear that
@ n1 +n2 w~ (x0 x0 ) = C (1 + v2 )(n1 +n2)=2 (1 + jx0 j + jvx0 ; (1 + v2 )1=2 x0 j);2m
@x01n1 @x02n2 Q 1 2 1 1 2
where again, the constant C does not depend on Q. Let n be an integer. By assumption,
the wavelet is of class C R for R > n and has at least n vanishing moments. Then from
standard wavelet estimates, it follows that one can nd a constant C such that
jKw ( )j C 2
0 ;j 0 n ;jn
2 (1 + v2 )n+1=2 (1 + jb0 j + jvb0 ; (1 + v2 )1=2 bj);2m
which implies the desired result.

Case (ii). Let us consider the function
Z
g (x1 ) =
R
j (u1 x1 ; u2x2 ; b)wQ(x1 x2) dx2
so that Kw ( 0 ) =
R (2j 0 (x1 ; b0 ))g (x1 )dx1 . For any n, there is a constant C such that
for any ` n
d` (u x ; u x ; b) C 2j(`+1)(1 + 2j ju1x1 ; u2x2 ; bj) ;m
dx`1 j 1 1 2 2
and
@ n;` w(x x ) C (1 + jx1j + jx2j) ;2m
C (1 + jx1j) ;m
(1 + jx2 j);m :
@xn1 ;` 1 2
Now the result from Chapter 4 guarantees that
dn g (x ) C 2j(n+1)(1 + jx1j) ;m
(1 + 2j ju1 x1 ; bj);m :
dxn1 1
And again standard wavelet estimates give in this case:
Z
jKw ( )j
0
= j 0 (x1 ; b )g (x1) dx1
0
C 2; j 0 n 2j (n+1) (1 + jb j) m (1 + 2j ju1 b ; bj) m

0 ;

0 ;
! ;m
2 1=2
C2 ; j 0 n 2j (n+1) (1 + jb j) m 1 + 2j b ; (1 + v ) b
0 ; 0
:
v
The lemma is proved.
Corollary 4 Let j 0 j and assume n < min(R D). Then, there is a constant C depending
on n and p so that:
(i) Suppose 2j (1 + v2 );1=2 1.
Z 1=p
jKw ( j 0
) p db0 C 2;j0 n2;jn (1 + v2 )n+1=2;1=2p
Z 1=p
jKw ( )jp db 0
C 2;j0n2;jn (1 + v2 )n+1=2;1=2p :
(ii) Case 2j (1 + v2 );1=2 1.

Z 1=p
jKw ( )jp db 0 0
C 2;j 0n 2j(n+1) 2;j=p
Z 1=p
jKw ( j 0
) p db C 2;j 0n 2j(n+1) 2;j=p:
Proof of Corollary. The proof is absolutely straightforward as one just needs to integrate
the upper estimates obtained in Lemma 10.
Corollary 5 Let j 0 j and suppose n < min(R D). For 2n > d=p ; 1, there is a constant
C (depending on n and p) so that
Z 1=p
jK( )jp db du
0 0 0
C 2;(j0 ;j)n2j 2;jd=p
Z 1=p
jK( j 0
) p db du C 2;(j0 ;j)n 2j 2;jd=p:
Proof of Corollary. It is not hard to show that

Z dv
(1+v2 )1=2 2 j
(1 + v2 )(n+1=2)p;1=2
(1 + v2 )d=2
Cnp2j(2n+1)p ;d]

if n is large enough. On the other hand,

Z dv
2 1=2
(1+v ) 2j (1 + v2 )d=2
C2 ;j (d;1)
:
Combining these two results we have

Z 1=p
jKw ( j 0
) p db0 du0 C2 ;(j 0 ;j )n
2j 2;jd=p + C 2;(j 0 ;j )n 2j 2;jd=p
which established the rst result. The second result is similar in every point.
From Corollary 5, one can deduce the result announced in Chapter 4 (Lemma 7).
Proposition 9 Let s > 0 and 0 < p q 1. For any j 0, u 2 Sd;1 , b 2 R, if min(R D)
is large enough, then
k j (u x ; b)kRpqs (d) C 2js2j=22jd(1=2 ;1=p)

for some constant C not depending on j u b.

Proof of Proposition. Let w be a nonnegative C 1 window supported on 2d such that
wjd = 1. It su ces to prove that kw(x) j (u x ; b)kRpq
s (Rd ) = O (2js 2j=2 2jd(1=2;1=p) ). But
Z 1=p
kw(x) j (u x ; b)kRspq = jKw ( j 0
) p db0 du0
( Z 1=p)
+ 2j 0 s2j 0 (d;1)=2 jKw ( j 0
) p db0 du0
`q
and applying our estimates (Corollary 5) gives
kw(x) j0(u x ; b)kRspq 1

C @2 jnq + X (2j0s2j0(d
; ;1)=2 ;jj ;j 0 jn min(jj 0 )(1;d=p) q
2 2 ) A
j0 0
0 1
X X
C 2j(s+(d;1)=2+1;d=p)q @ 2;(j;j0)(n+s+(d;1)=2+1;d=p)q + 2;(j0;j)(n;s;(d;1)=2)q A
0 j 0 >j
h js j=2 jd(1=2;1=p) iqj j
Cq 2 2 2
where the last inequality holds as long as n > s + (d ; 1)=2. As a conclusion, we have that
kw(x) j (u x ; b)k C 2js2j=22jd(1=2 ;1=p)

as expected.
5.2.2 Embedded Hypercubes

For the frame construction, the discretization of the directional variable u involved the
covering of the sphere by balls. Indeed, at a given scale j , the discrete set of directions )j is
a covering set of the sphere: a set such that the collection of balls centered at the set points
and of radius proportional to 2;j cover the whole sphere S d;1 . We have discussed the
importance of good covering sets and their relationship to the control of the frame bounds
ratio.
Inversely, the lower bounds involve properties of packing sets of the sphere: for a xed
> 0, how can we distribute points on the sphere such that balls of radius and centered
at these points don't overlap? The maximum number of points we can distribute is called
the packing number. Again, there is a considerable literature (Conway and Sloane, 1988)
on this matter that the reader can refer to. In the sequel, we shall only make use of trivial
facts about this packing problem.
The purpose of this section is to establish a technical fact that would guarantee that
a certain functional class cannot be approximated by nite linear combinations of a given
dictionary D faster than a certain rate. It is important to note that the proof of this fact
(or in other words of the existence of lower bounds of approximation) does not need to be
constructive. This observation greatly simpli es our argument.
For a xed j , let Sj be a set of points on the sphere fui g satisfying the following
properties.
(i) 8ui ui0 2 Sj kui ui0 k 2;(j ;j0 ) .
(ii) jSj j B1 2(j ;j0 )(d;1) .
(iii) For any u 2 S d;1 , and all 0 l j ; j0 ,
Z
jfui 0 (1 ; (juu uuiij)2)1=2 1gj B2 2(j ;j0 )(d;1) dv
Zjvj1 (1 + v ) dv
2 d=2
jfui 2l 1 (1 ; (juu uuiij)2)1=2 2l gj

;
B2 2(j ;j0 )(d;1) l;1
2 jvj2l (1 + v 2 )d=2
In the above expression, the constants B1 and B2 can be chosen to not depend on j and j0 .
We remark that the rst property (i) implies that
fui0 (1 ; (juuii uuii00j)2)1=2 2j ;j0 g = fui g:
This fact is a mere consequence of
kui0 uik2 = 2(1 ui ui0 ):

Indeed let, vii0 = ui ui0 (1 ; (ui ui0 )2 );1=2 . Suppose for instance that vii0 2j ;j0 . We have
!
kui0 ; uik2 = 2 1; vii0
(1 + vii
2 0 )1=2
= 2 1 (1 +1v2 0 ) :
(1 + vii
2 0 )1=2 (vii0 + (1 + v2 0 )1=2 )
ii ii
Therefore, vii0 2j ;j0 implies kui0 ; ui k < 2;(j ;j0 ) . From (i), it follows that it is equivalent
to i = i0 . The argument is identical in the case vii0 ;2j ;j0 .
To simplify the analysis, suppose 2 S (R) compactly supported supp $;1=2 1=2],
and has a su ciently large number of vanishing moments. We normalize such that
k k2 = 1. Further, letpw 2 C01(d) be a radial window such that 0 w 1 and w(x) = 1
for any x with kxk 3=2. We now consider the set Aj of windowed ridgelets at scale j
Aj = ffik (x) = 2j=2 (2j ui x ; k)w(x) ui 2 Sj k 2 Z and jkj2;j 1=2g: (5.8)

p
Finally, we will assume j 2 so that 1=2 + 2;j =2 2=2 from our assumptions it follows
p
that supp fik fx jui xj 2=2g for any fik in Aj .
We show that if j0 is large enough then the elements of Aj are \almost" orthogonal.
That is, we prove the following result.
Lemma 11 (i) There is a constant cd (only depending upon the dimension d) s.t.
8 f 2 Aj kf k2 cd : (5.9)
(ii) If j0 is chosen large enough,

X
8 f 2 Aj jhf gij c2d : (5.10)
g2Aj g6=f
Proof of Lemma. The norm of fik being clearly invariant by rotation (w radial), one can
assume that ui = e1 . We have

Z
2j 2 (2j (x1 ; k2;j )) w2 (x) dx
Z Z
p 2j 2 (2j (x1 ; k2;j )) w2 (x) dx1 dx2 : : : dxd
Zjx j 2=2 x +:::xd (1=2)
Z
2 2 2
1 2
p
jx1 j 2=2
;
2j 2 (2j (x1 k2;j ))dx1 2 2
x2 +:::xd (1=2)2
1 dx1 dx2 : : : dxd
= kk 2c = c
2 d d
where cd might be chosen to be the volume of a d ; 1 dimensional ball of radius 1=2. This
proves (i).
Before proceeding further, observe that if 0 < 1, x 2 R, y 2 R, and > 0 we
have
X
(1 + jx ; kj);1; (1 + jy ; kj);1; C ;1
(1 + jy ; x j)
;1 ;1;
: (5.11)
k2Z
By construction, it is pretty clear that the supports of (2j ui x ; k) and (2j ui x ; k0 )

do not overlap when k 6= k0 . Therefore,
X
jhfik fik0 ij = 0:
k0 k0 6=k
6 u0i shows that one

Next, a simple application of our fundamental estimates when ui =
can nd a constant C1 (d) depending on d, and w such that (note that in this case
2 0 )1=2 2j ),
(1 + vii
jhfik fi0k0 ij C1(d) 2 ;j (2d+1) 2 0)

(1 + vii
2d+1
2 (1 + 2;j jvii0 k ; (1 + vii
2 0 )1=2 k 0 j);2
Now, it follows from (5.11) that

X
jhfik fi0k0 ij C2 (d) 2;j (2d+1) (1 + vii2 0 ) 2d2+1 2j (1 + vii2 0 );1=2
k0
= C2 (d) 2;2jd (1 + vii
2 0 )d
for some new constant C2 (d), depending only on d, and w. Summing over ui0 (ui0 6= ui )
and making use of our assumption (iii) gives

X X X
jhfik fi0k0 ij = jhfik fi0k0 ij
fi0 k0 2Aj fi0 k0 6=fik ui0 ui0 6=ui k0
jX
;j0
C2 (d) 2 ;2jd
(1 + 22` )d jfui0 2`;1 jvii0 j 2` gj
`=0
jX
;j0 Z dv
C2 (d) 2 ;2jd
B2 2(j ;j0 )(d;1) (1 + 22` )d
2`;1 jvj2` (1 + v2 )d=2
`=0
jX;j0
C3 (d) 2(j;j0 )(d;1) 2;2jd 2`(2d+1;d)
`=0
C4 (d) 2 ( j ;j0 )(d;1) ;2jd (j ;j0 )(2d;(d;1))
2 2
= C4 (d) 2;j0 2d
where again C4 (d) is a new constant C (d w). (Notice that we have sacri ced exactness
for synthetic notations: in the second line of the array, read jfui0 0 jvii0 j 1gj instead
of jfui0 2`;1 jvii0 j 2` gj when the index ` equals 0.) Therefore by choosing j0 large
enough, one can make sure that the quantity Cd 2;j0 2d is dominated by cd , which proves
(ii).
Lemma 12 Let C the parallelepiped be dened by
X
C = ff f= ik fik jik j 1g: (5.12)
ik
Then, for any f in C and triplet s p q s > 0 0 < p q 1, we have
kf kRpqs (d) C 2js2jd=2

where the constant C depends at most on s p q w and the dimension d.
P
Proof of Lemma. Let f be de ned by f = ik ik fik and let us calculate the norm of
f . First notice that since supp f d, then kf kRpq s (d ) = kf kRpq
s (Rd ) . Further, since
kf kRpqs kf kRpqs 0 whenever q q, we only need to prove the result for say 0 < q 1.
0
Finally, as to simplify notations, we show the result for p = 1 as the argument for p 6= 1 is
absolutely parallel, we will just indicate where to modify the proof to handle this case. We
have
ZX
jwj0 (u b )(f )j
0 0
= ik fik (x)2j 0 (2j0 (u0 x ; b0 ))dx
ik
X Z
jikj 2j=2 (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx
ik
XZ
2j=2 (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx
ik
To be consistent with the preceding section and to simplify the notations, let Kw ( ik 0 )
be de ned as
Z
Kw ( ik
0
)= 2j (2j (ui x ; k2;j )) 2j 0 (2j 0 (u0 x ; b0 ))w(x)dx:
(Because of an exigence of consistency, we want to bring to the reader's attention that the
normalization has changed, i.e. 2j=2 has been replaced by 2j .) Once again we apply our
fundamental estimates. Let vi be ui u0 =(1 ; (ui u0 )2 )1=2 . Suppose for now that j 0 j , we
have for vi 2j
jKw ( ik )j C 2
0 ;j 0 n ;jn
2 (1 + vi2 )n+1=2 (1 + jb0 j);2 (1 + jvi b0 ; (1 + vi2 )1=2 k2;j j);2 :
And summing over k gives

X
jKw ( ik )j 0
2;j 0 n 2;jn (1 + vi2 )n+1=2 (1 + jb0 j);2 2j (1 + vi2 );1=2
k
= 2;j 0 n 2;j (n;1) (1 + vi2 )n (1 + jb0 j);2 :
Recall our assumption (ii) states that for ` > 0,

Z dv
jfui 2`;1 jvi j 2` gj B2 2(j ;j0 )(d;1)
2`;1 jvj2` (1 + v2 )d=2

and, similarly, for the interval 0 jvj 1. Repeating the argument of the previous lemma
gives
X X jX
;j0
jKw ( ik )j 0
C2 ;j 0 n ;j (n;1)
2 B 2(j ;j0 )(d;1) 2`(2n;(d;1)) (1 + jb0 j);2
ijvi j2j;j0 k `=0
C 2;j0 n2jn2j (1 + jb j)
0 ;2
:
Now for vi 2j ;j0 , since our estimate for vi 2j dominates the one for vi 2j , one can
check that
2 1=2
!;2
jKw ( ik )j C 2
0 ;j 0 n
2j (n+1) (1 + jb j)
0 ;2
1 + 2j b ; (1 + vvi ) k2;j
0
i
and, again, summing over k yields

X
jKw ( ik )j C 2
0 ;j 0 n
2j (n+1) (1 + jb0 j);2 :
k
Now, we know that jvi j 2j ;j0 implies ku0 ; ui k 2;(j ;j0 ) and, similarly, when the index
i is replaced by i0 . Suppose S = fi jvij 2j;j0 g 6= and let i0 be one of its elements.
Then, by the triangular inequality jvi j 2j ;j0 =) kui0 ; ui k 2 2j ;j0 , it follows that
the cardinality of S can be bounded by a constant depending at most on the dimension d.
Therefore, we have
X X
jKw ( ik )j C 2 0 ;j 0 n
2j (n+1) (1 + jb0 j);2 :
ijvi j 2j;j0 k
Combining these two estimates yields

XX
jKw ( ik )j C 2
0 ;(j 0 ;j )n
2j (1 + jb0 j);2 :
i k
Similarly, for j 0 j , an identical reasoning allows getting similar upper bounds (suppose
n > d). First, assume that jvi j 2j 0 . From
jKw ( ik )j C 2
0 ;jn ;j 0 n
2 (1 + vi2 )n+1=2 (1 + jk2;j j);2 (1 + jvi k2;j ; (1 + vi2 )1=2 b0 j);2 :
one can deduce (applying (5.11)) that

X
jKw ( ik )j C 2
0 ;jn ;j 0 n
2 2j (1 + vi2 )n (1 + jb0 j);2 :
k
And therefore, already well-known arguments imply that
X X
jKw ( ik )j C 2
0 ;(j ;j 0 )(n;(d;1))
2j (1 + jb0 j);2 :
i jvi j2j0 k
Similarly, in the case where vi 2j 0 , we have

2 1 =2
!;2
jKw ( ik )j
0
C 2;jn2j 0 (n+1) (1 + jk2 j j)
; ;2
1 + 2j k2;j ; (1 + vvi ) b0
i
giving
X
jKw ( ik )j C 2
0 ;jn
2j 0 (n+1) (1 + jb0 j);2 :
k
Again, summing over the i's yields
X X
jKw ( ik )j2 0 ;(j ;j 0 )(n;d)
2j (1 + jb0 j);2 :
ijvi j 2j0 k
Combining the results, we have

XX
jKw ( ik )j C 2
0 ;(j ;j 0 )(n;d)
2j (1 + jb0 j);2 :
i k
The following display summarizes the situation:
(
C 2;(j ;j 0)(n;d) 2j=2 (1 + jb0 j);2 j 0 j
jwj0 (u b )(f )j
0 0
C 2;(j 0;j)n 2j=2 (1 + jb0 j);2 j0 j
: (5.13)
In the case p < 1, we just need to replace (1 + jb0 j);2 with (1 + jb0 j);2=p at each of its
occurence in (5.13) as it follows from the same argument using, however, Lemma 10 with
sharper bounds. A direct consequence of the above inequalities together with the preceding
comment gives that for any p > 0

Z 1=p ( C 2;(j;j0)(n;d) 2j=2 j 0 j
jwj0 (u0 b0)(f )jp db0du0 ;(j0;j)n j=2 C2 0 2
:
j j
As long as n is chosen to be greater than max(s + (d ; 1)=2 d), it is immediate to establish
from here that for any q > 0,
Z 1=p ( Z 1=p)
jv(u b )(f )jp db du
0 0 0 0
+ 2 2 j 0 s j 0 d;1
2 jwj0 (u b )(f )jp db du
0 0 0 0
C 2jd( ds + )
1
2
`q
which is our claim.

Note that the previous lemma shows how to construct a full parallelepiped embedded in
s
Rpq (d ). However, in view of Theorem 10 one needs to construct a cube. The next lemma
shows how to orthogonalize our parallelepiped.
Lemma 13 Suppose we have d vectors ffig1id in a Hilbert space such that for all 1
id
(i) kfi k = 1,
P jhf f ij 1 ; < 1.
(ii) j 6=i i j
P
We consider the set C = f di=1 yifi kyk1 1g. Then there exists a hypercube H of
sidelength that is included in C .
Proof of Lemma. Let us consider the symmetric matrix G de ned by Gij = hfi fj i.
Applying the Gershgorin Theorem, we deduce from the hypotheses (i) and (ii) that all the
eigenvalues of G must be greater or equal to . Therefore G is a positive de nite matrix
and we can talk about H = G;1=2 . It is an easy exercise to see that the collection of vectors
feig1id de ned by ei = Hfi is indeed an orthogonal basis of span(ffig1id) (See Meyer
Meyer, 1992, Page 25 for a proof.) Furthermore, a trivial fact states that
X X
xi e i = x0ifi whenever x0 = Hx:
i i
Thus the embedding problem becomes: show that kxk1 =) kHxk1 1. This is
nothing else but to prove that the norm of H , as an operator from `1 ! `1 , is bounded
by ;1 . Recall,
kH k(`1`1) = supi X jHij j:

j
We now derive an upperbound of kH k(`1 `1 ) . We have

1 Z1
;1=2
H= (G + I );1 d
0
(see Meyer for a justi cation of this fact). The previous relationship implies that
Z1
kH k(`1`1) 1 0
k(G + I );1 k(`1 `1) ;1=2
d:
Now G = I ; F , G + I = (1 + )I ; F = (1 + )(I ; (1 + );1 F ). The standard inversion

formula for matrices (Neuman series) states
X
(G + I );1 = (1 + );1 (I + (1 + );k F k )
k 1
which gives
X
k(G + I );1 k(`1 `1) (1 + );1 (kI k(`1 `1) + (1 + );k kF k k(`1 `1) )
k 1
X
;1
(1 + ) (1 + (1 + );k kF kk(`1`1))
k 1
(1 + );1 1 ; kF 1k :
(`1 `1)
Finally,
kH k(`1`1) 1 Z 1 (1 ; kF k ;1 ;1 ;1=2
d = (1 ; kF k(`1 `1 ) );1 :
(`1 `1 ) ) (1 + )
0
By assumption we have kF k(`1 `1) 1 ; implying kH k(`1 `1 ) ;1 , which is precisely
what needed to be proved.
We are now in a position to state the main theorem of this section.
Theorem 11 Suppose 1 p q 1. Let Rpq
s (1) be the unit ball of R s (2d ). Then there
pq
s (1). That is we
exists an hypercube of sidelength and dimension m() embedded in Rpq
can nd m() orthogonal functions gim i = 1 : : : m() with kgim k2 = , such that
X
H( fgi g) = ff f= igim jj 1g s (1)
Rpq
i
and
1
m() K; s=d+1=2 :
Proof of Theorem. The proof is a mere consequence of the three preceding preparatory
lemmas. In view of Theorem 10, this is exactly what needed to be shown to prove the rst
part of Theorem 9.
5.3 Upper Bounds

In this thesis, we will prove the upper bound (5.6) for the classes Rpq s in the following
cases: (s p q) 2 f(s 1 q) (s 2 2) (s 1 q) q > 0g, where this set is of course subject to
the additional condition s > d(1=p ; 1=2)+ as it is a required hypothesis of our theorem.
5.3.1 A Norm Inequality

Let a frame be constructed as in Chapter 3, section 3.6.1. We de ne a discrete norm on the
frame coe cients = hf i as follows
X
kkrspq k(2js2jd(1=2 ;1=p)
( j jp)1=p)j 1k`q :
; (5.14)
2;j
The norm (5.14) is the discretized version of the continuous norm (4.3).
For compactly supported functions, we have the following lemma.
Lemma 14 There is a constant C possibly depending on s p q such that
kkrspq C kf kRpqs : (5.15)
We prove the lemma for the subset of triplets (s p q) de ned in the rst paragraph of this
section.
Proof of Lemma. Without loss of generality, we may assume that supp f d . For a
frame of L2 (d ), let A be the Analysis operator de ned by Af = ( )d 2;d . To prove the
lemma one has to prove that
kAf krspq C kf kRpqs :

Case p = 1. We rst treat the case p = 1. Introducing a slight change of notations, let
Rj (u b) be the ridgelet coe cient de ned by Rj (u b)(f ) = hf 2j=2 (2j (u x ; b))i (with
the convention that R;1 (u b)(f ) = hf '(u x ; b)i). Suppose that w is a nice C 1 function,
0 w 1, supp w 2d and wjd = 1. The trivial equality f = fw together with the
semi-continuous reproducing formula (2.8) imply that
Z
hf i / R;1 (u0 b0 )(f )h(u0 x ; b0 ) w idu db 0 0
X Z
+ 2j 0 d Rj 0 (u0 b0 )(f )h2j 0 =2 (2j0 (u0 x ; b0 )) w idu db :
0 0
j0 0
In view of the Parseval relationship (Proposition 2), the above relationship is fully justi ed
s for s > d(1=p ; 1=2)+ implies (for compactly supported functions)
as the membership to Rpq
membership to L2 (see last paragraph of section 4.2.1). Then, observe that
X Z X
j j jR ;1 (u0 b0 )(f )j jh(u x ; b ) w ijdu db
0 0 0 0
2;j
X Z2;j
+ 2j0d jRj0 (u b )(f )j X jh2j0=2
0 0
(2j 0 (u0 x ; b0 )) w ijdu0 db0
j0 0 2 ;j
Now apply a properly renormalized version of (5.13) to get

X X Z
j j 2j 0 d 2;(j ;j 0 )(n;d;1=2) jRj0 (u b )(f )jdu db
0 0 0 0
2;j ;1j 0 j
X Z
+ 2j 0 d 2;(j ;j 0 )(n+1=2) jRj0 (u b )(f )jdu db
0 0 0 0
j 0>j
Then,
X X Z
2js 2;jd=2 j j 2;(j ;j 0 )(n;d=2;s;1=2)
2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db
0 0 0 0
2;j ;1j 0 j
X Z
+ 2 ;(j ;j 0 )(n+d=2;s;1=2)
2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db
0 0 0 0
j 0 >j
and from there, it is then easy to see that for any q > 0
X Z
kk krs1q (2js 2;jd=2 j j)j k`q k (2j 0 s 2j 0 d=2 jRj0 (u b )(f )jdu db )j0 k`q = kf kRsq :
0 0 0 0
1
2;j
(The cases where q 1 and q = 1 are clear, and interpolation does the rest.)
Case p = 1. In this case, we have
Z
j j jR ;1 (u0 b0 )(f )jjh'(u0 x ; b0 ) w(x) (x)ijdu0 db0
X Z
+ 2j 0 d jRj (u b )(f )jjh2j0=2
0 0
(2j 0 (u0 x ; b0 )) w(x) (x)ijdu0 db0
j0 0
Z
kR ;1 ( f )k 1 jh'(u x ; b ) w(x) (x)ijdu db
0 0 0 0
X j0d Z 0
+ 2 kRj 0 (f )k jh2j =2 (2j0 (u x ; b )) w(x)
1
0 0
(x) ijdu db 0 0
j0 0
Recall the estimates relative to the decay of the kernel (Corollary 5)

Z
jh2j0=2 (2j 0 (u0 x ; b0 )) w(x) (x)ijdu0 db0 2;jj ;j 0j(n+1=2) 2; min(jj 0)d
= 2;jd=2 2;j 0 d=2 2;jj ;j 0j(n;(d;1)=2) :
Thus, making use of this result we obtain
2js 2jd=2 j j X 2js2j0d=22 j j0 (n (d 1)=2) kRj0 (f )k

;j ; j ; ;
1
j0 1
X 2j0s2j0d=22 j j0 (n s (d 1)=2) kRj0 (f )k
;
;j ; j ; ; ;
1 :
j0 ;1
Now again, it is clear that the above inequality gives for any q > 0
Z
kkrs1q = k(2js2jd=2 sup;j j j)j k`q k(2j0s2j0d=2 jRj0 (u b )(f )jdu db )j0 k`q = kf kR1s q :
2
0 0 0 0
(5.16)
Case p = q = 2. This case is a direct consequence of Theorem 6.
Extension. It seems reasonable to suspect that one can extend Lemma 14 by inter-
s is a weighted space of the type `q (Lp ): using the
polation. By de nition, the space Rpq
s norm of an object as
notations of the lemma, one can rewrite the Rpq
0 Z q=p11=q
kf kRpqs = k2js2jd=2kRj (f )kLp(Sd; R)k`q = @ X
1
Sd;1 R
jRj (u b)jpdudb A
j ;1
with obvious modi cation in the case p = 1 or/and q = 1. (Again, there is a full analogy
with the Besov scale replace Rj (f ) with f 'j and the weights 2js 2jd=2 with 2js .) Similarly,
the sequence space rspq is a weighted space of the type `q (`p ). From Lemma 14, we know
that for any s q > 0, the operator A is bounded from R1sq (respectively, R1 s ) to rs
q 1q
(respectively, rs1q ). Now, the sequence spaces rspq are clearly interpolation spaces with
well-known properties and one expects the same interpolation properties to be true for the
s . If that were true, one would be able to prove Lemma 14 for the full range of
spaces Rpq
parameters s > 0, 1 p 1, q > 0 { at least when Rpq s L2 (d ) { since bounds at
the \corners" are already established. Work in this direction is in progress and we hope to
report on it shortly.
Further, we have learned that in the case where p = q = 2, the norms kkrs22 and kkR2s2
were actually equivalent therefore, this favors the conjecture we have just brought up and
leads to a more ambitious one: namely, the norm equivalence between kkrspq and kkRpq s .
Lemma 14 has a very interesting corollary. Let us consider a frame like in section 3.6.1
such that both ' and are compactly supported and that has enough vanishing moments
so that Lemma 14 holds. The assumption of compact support will simplify the proof of the
corollary but is not a necessary hypothesis.
Corollary 6 Assume s > d(1=p ; 1=2)+ and let p be dened by 1=p = s=d + 1=2, then
s (d ) we have
for f in Rpq
kkw`p C kf kRpqs (d) (5.17)
where w`p is the weak-`p quasi-norm (5.20).

s (d ),
In fact, one should be a bit more careful in the above statement: for a function f in Rpq
s (Rd ) supported in 2d then take the sequence to be the
extend it to a function in Rpq
frame coe cients of the extended f (where the frame is for L2 (2d )).
Proof of Corollary. The proof is a minor adaptation of a result in (Donoho, 1993). We
do not reproduce it here.
Now, the proof of the upper bound is almost complete as it is a direct consequence of
the result presented in the next section.
5.3.2 A Jackson Inequality

Besides the approximation of functional classes, one can be interested in knowing achievable
rates of approximation for a given target f . The next simple lemma establishes that the
sparsity of the ridgelet coe cients imply a minimum rate of convergence (i.e. a lower bound
on the exponent of approximation (7.1)) of truncated N -term expansions.
Suppose we are given a frame ( ) 2 ;d . We know that any element of, say L2 (d )
might be represented as
X
f= hf i ~ = X hf ~ i : (5.18)
2;d 2;d
To simplify, will denote the ridgelet coe cient hf i. Finally, f~N will denote the N -
term ridgelet expansion where one only keeps the terms corresponding to the N largest
coe cients. That is
X
f~N (x) = 1fj j jj(N ) g
~ : (5.19)
2;d
We recall that for a sequence (n ), the weak-`p or Marcinkiewicz quasi-norm is de ned
as follows. Let jj(n) be the nth largest entry in the sequence (jn j) we set
jjw`p = sup
n>0
n1=p jj(n) : (5.20)
Lemma 15 Let 0 < p < 2 and suppose that the sequence of coecients ( )2;d has a
bounded weak-`p norm. Then there is a constant C (depending at most on p) such that
kf ; f~N k2 CN r jjw`p
;
with r = 1=p ; 1=2.
Proof of Lemma. Let A (respectively S ) be the analysis operator (respectively synthesis

operator) de ned by
A : L2 (d) ! `2(;d) S : `2 (;d) ! L2(d )

X
f !7 hf i 7! ~ :

Then, the frame decomposition (6.7) tells us that, rst, SA is the identity of L2 (d ), and,
second, K~ = AS is the orthogonal projector onto the range of A. Again, let = hf i
be the ridgelet coe cients of f and (N ) be the truncated sequence of the N th largest
coe cients, i.e. (N ) = 1fj j jj(N ) g . Then, of course, f~N = S(N ) . Now the frame
property gives
kf ; f~N kL (d) B kA(f ; f~N )k` (;d)

2 2
= B k ; K~ (N ) k` (;d ) 2
= B kK~ ( ; (N ) )k` (;d ) 2
B k ; (N ) k` (;d) 2
since the norm of K~ is 1. Now, Lemma 1 in Donoho (1993) allows us to conclude that
k ; (N )k2` (;d) = X jj2(n) a(p) jj2w`p

2
n>N
and therefore, we have reached the desired conclusion.
Lemma 15 together with Corollary 6 nishes the proof of the second part of Theorem 9.
5.4 Applications and Examples

This section uses the preceding lemma to study the approximation of speci c examples of
singular objects. Let w be a window in C01(d ) and let us consider for a moment the
function f (x) = (x1 )+ w(x). Our example f is a C 1 function away from the hyperplane
x1 = 0 but which is singular across this same hyperplane: the singularity being of degree .
As we will see, the representations of such objects in a ridgelet frame are extremely sparse.
In some sense, ridgelets are optimal to represent objects of this kind.
In all that follows, we will use the notations from Chapter 3. We will assume that '
and are R times di erentiable and that has vanishing moments through order D. Now
repeating the argument of section 5.2.1, we have
Z
R (f ) = f(x)2j (2j (u x ; b)) dx
ZRd
= 2j (2j (x01 ; b))(1 + v2 );=2 (vx01 ; x02 )+ wQ (x01 x02 )
R2
where wQ is exactly as in the beginning paragraphs of section 5.2.1. So, R (f ) is the wavelet
coe cient of hv where hv is de ned by
Z
hv (t) (1 + v2);=2 y+ wQ (t y ; vt):
From the last display one can see that
dn h (t) = (1 + v2);=2 Z y X nv` (@ n;`@ ` w )(t y ; vt) dy:
dtn v +
` `
1 2 Q
Now, a single integration by parts argument shows that

Z
y+ v` (@1n;`@2` wQ)(t y ; vt) dy C jvj`(1 + jvtj) ;`
1fjtj1g
R
estimates that can be turned into
Z
y+ v` (@1n;` @2` wQ )(t y ; vt) dy C (1 + jvj2)n=2(1 + (1 + v2)1=2 jtj) ;n
1fjtj1g
R
for any ` n and some constant C depending on n. Then

dn h (t) C (1 + jvj2 )(n ;)=2
(1 + (1 + v2 )1=2 jtj);n 1fjtj1g
dtn v
where again the constant C might be chosen to only depend on n. In the remainder of the
proof, we will suppose ; n < ;1.
Now, let v be such that (1 + v2 )1=2 2j . Standard wavelet calculations show that
jR (f )j C 2 ;j (n+1=2)
(1 + v2 )(n;)=2 (1 + j(1 + v2 )1=2 k2;j j);n :
For (1 + v2 )1=2 2j , a simple calculation, parallel to what we have done, argues
jR (f )j C 2 ;j (+1=2)
(1 + jkj);n :
We are going to show that, for a nice frame, the ridgelet coe cients of f are in `p for
any p > 0. So, let p > 0 be xed. Our estimates imply that for (1 + v2 )1=2 2j ,
X
jR (f )jp C p 2 ;jp(n+1=2)
(1 + v2 )p(n;)=2 2j (1 + v2 );1=2
k
if n is chosen large enough so that ( ; n)p < ;1. Next we recall that for s 0
X
(1 + vi2 )s=2 C max(2js 2j (d;1) )
i2Ij :(1+vi2 )1=2 2j
and hence, summing over i's gives

XX
jR (f )jp C p 2;jp(n+1=2) 2j 2jp(n;) 2;j
i2Ij k
= C p 2;jp(1=2+)
if n satis es the additional condition: n ; > d=p.

Hence, we have proved that the ridgelet coe cients R (f ) 2;d { for a frame as in Chapter
3, with ' and being R times di erentiable and having vanishing moment trough order
D { satisfy
k(R (f )) ;d k`p C
2
provided d=p + < min(R D) which is what we claimed.

Applying the Jackson inequality (Lemma 15) gives
Proposition 10 For any > ;1=2, for any u 2 S, b 2 R let f (x) = (u x ; b)+ . As usual
f~N will denote the truncation of f to the dual-ridgelet terms corresponding to the N largest
ridgelet coecients. Then,
kf ; f~N kL (d) = O(N

2
;r
) for any r > 0:
(Note that for ;1=2, f is not square integrable.) Various extensions of results of this
kind are, of course, possible.
Chapter 6
The Case of Radial Functions
In Chapter 5 we have seen that ridgelets are optimal to represent functions that may
be singular across hyperplanes when there may be an arbitrary number of hyperplanes
with any orientations and locations. A natural question one may ask: can we curve these
hyperplanes (singular sets)? In other words, how are good ridgelets to represent objects
being singular across curved manifolds? To answer this question, the study of radial objects
is particularly attractive for it is both enlightening and simple. We will give accurate degrees
of approximation of radial functions by ridgelets. Finally, we will establish a remarkable
result: the rates of approximation by ridgelets to radial functions which are smooth away
from spheres are identical to the rates achieved by wavelets.
The analysis will illustrate why ridge functions (neural networks) are not free from the
curse of dimensionality.
Again, we shall restrict our attention to the case of compactly supported objects.
6.1 The Radon Transform of Radial Functions

Let f be a radial function f (x) = '(kxk) where ' is an even univariate and real valued
function. Throughout this chapter we will suppose that supp f d , the unit ball of Rd .
87
CHAPTER 6. THE CASE OF RADIAL FUNCTIONS 88
Recall that Ru f is the Radon transform of f . We have

Z
Ru f (t) = '(kxk) dx
Zu x= t
p
= '( t2 + kyk2 ) dy
ZRd;1 Z 1 p
= '( t2 + 2 )d;2 ddu
S d;2 Z0
p
j j
1
= S d;2 '( t2 + 2)d;2 d
Z0 1
= j
S d;2 j
jtj
;
(r2 t2 )(d;3)=2 '(r)r dr: (6.1)
To simplify the notation, we will let Td be the operator de ned as

Z1
(Td ')(t) = (r2 ; t2 )(d;3)=2 '(r)r dr (6.2)
jtj
so that Ru f (t) = jS d;2 j(Td ')(t). The operator Td is an integral operator whose kernel is
r(r2 ; t2)(+d;3)=2 from (6.2) one sees easily that for t 0, we have
p 1 Z p
(r ; t)(+d;3)=2 '( r) dr:
1
(T ')( t) =
d 2 0
Therefore, if we let Ud be the convolution operator de ned by
Z1
(Ud g)(t) = (r ; t)(+d;3)=2 g(r) dr (6.3)
0
p
and C be the change of coordinates g(t) 7! g( t) (C ;1 corresponding of course to g(t) 7!
g(t2 )), we have
Td ' = 12 C ;1 Ud C':
Now, suppose that we rst look at functions vanishing in a neighborhood of the origin.
For instance, let > 0 such that 'j0] = 0. We have for 1 p q 1
kTd'kBpqs d; = Ck'kBpqs
+( 1) 2
where, in addition to , the constant C depends on s p q. To see this, observe that for
p
supp g $ 1], the operator C : g(t) 7! g( t) is bounded from Bpq s (0 1) to itself for any
choice of s p q. In addition, C ;1 : g(t) 7! g(t2 ) is bounded from Bpq
s (0 1) to B s (;1 1),
pq
again for any possible triplet s p q. The only thing remaining to prove is that Ud is bounded
s+(d;1)=2 (0 1). This is not very di cult and is shown in the Appendix.
s (0 1) to Bpq
from Bpq
As a matter of fact, we probably do not need the additional assumption ' to vanish in
a neighborhood of the origin.
Proposition 11 Suppose ' is an even function such that ' 2 Bpq
s (;1 1). Then Td ' 2
s+(d;1)=2 (;1 1) and
Bpq
kTd'kBpqs d; = (
+( 1) 2
;11)
C k'kBpqs ( ;11) (6.4)
for some constant C depending at most on s p q and the dimension d. Moreover, suppose
in addition that ' vanishes in $; ] then we have the reverse inequality
kTd'kBpqs d; = (
+( 1) 2
;11)
C k'kBpq
s (;11) (6.5)
where, here, the constant C may also depend on .

Proof of Proposition. We give a proof of the rst part of the proposition in the case of odd
dimension d. For the time being, we omit the proof when d is even. The reader who might
doubt the validity of the result in the latter case is invited to add the additional condition
' to be supported away from 0 in the reminder of the chapter (since, we have proved the
result in this case).
s is equivalent to both memberships to Lp and
First, observe that the membership to Bpq
B_ pq
s . Now the following semi-norms are equivalent:
kf kB_ pqs kf kB_ pqs;

0
1
and, therefore,
kf kBpqs kf kp + kf kB_ pqs;

0
1
in the sense of equivalent norms. Now suppose further that f is compactly supported, say,
in the interval $;1 1]. For p 1, we have the Poincar'e inequality which states that
kf kp C (p)kf kp
0
and therefore, for compactly supported functions, kf kp + kf kB_ pq

0 s;dominates kf kBpq
0 1 s .
Now, we prove the proposition by induction. For d = 3, it is not hard to see that
d
dt (T3 ')(t) = ;t(t) for t 6= 0:
Following our observation, if ' 2 Bpq

s it is then obvious that T3 ' 2 B s+1 and that we have
pq
kT3'kBpqs C k'kBpqs
+1
for some constant possibly depending on s p q.

Next, suppose that Td ' 2 Bpq
s+1 and satis es (6.4). Again, one can see easily that
d
dt (Td+2 ')(t) = ;t(Td ')(t) for t 6= 0:
Therefore, the same argument as the one for d = 3 yields (6.4) for d + 2.
We turn to the proof of the lower bound (6.5). Let be a C 1 function such that
0 1, = 1 whenever jtj 1, and = 0 for jtj 1=2 (the notation will
stand for (= )). Now, there are well-known inversion formulas for the Radon transform.
In particular, when the function f is radial (f (x) = '(kxk), it is easily established that
(Natterer, 1986)
'(t) / Ld;1 (Td Td ')(t) (6.6)
where L is the operator 1t dtd . Again, we assume that ' 2 Bpq

s and let us consider g =
=2 (Td Td)' (note that the rst part of our proposition implies that g 2 Bpq
s+d;1 ). We
claim that
kLgkBpqs d; C kgkBpqs d; :
+ 2 + 1
To see this, recall that the di erentiation operator dtd is a bounded map between Bpq
s and
s;1 for any ;1 < s < 1, as it can be obtained as a corollary of two theorems from
Bpq
Triebel (1983, Pages 57{61). Further, the multiplication by 1=t is of course a bounded
s;1 restricted to its elements supported in, say, f=4 jtj 1g supp g.
operation in Bpq
By induction, we get that
kLd 1gkBpqs;d C kgkBpqs :

;
But of course, it follows from (6.6) that for jtj =2, we have Ld ;1 g(t) = '(t) implying
' = Ld;1g. Therefore, to summarize, we have
k'kBpqs = kLd 1gkBpqs C kLd 1gkBpqs C kgkBpqs d

; ;
+
= C k=2 (Td Td )'kBpq

s d C k(Td Td )'kBpq
+ s d C kTd 'kB s d; =
+
pq
+( 1) 2
where the last inequality comes from the rst part of the proposition. This completes the
proof of the proposition.
6.2 The Approximation of Radial Functions

Proposition 11 has important consequences. For instance, suppose we have a frame of
L2 (d ) (see Chapter 3)
f'(uji x ; kb0) 2j=2 (2j uji x ; kb0) j 0 uji 2 )j k 2 Zg
where we recall that j)j j = O(2j (d 1) ). We will take both ' and to be compactly
;
supported. Following Chapter 2, we will denote the element of the frame for 2 ;d .
We know that any element of, say, L2 (d ) might be represented as
X
f= hf i ~ = X hf ~ i : (6.7)
2;d 2 ;d
To simplify, will denote the ridgelet coe cient hf i. Finally, let f~N be the N -term
ridgelet expansion where one only keeps the terms corresponding to the N largest coe cients
(5.19). We have the following theorem:
Theorem 12 Let f be a radial function supported in the unit ball f (x) = '(kxk) such that
' 2 Bpq
s . Suppose that s > d(1=p ; 1=2)+ . Then we have the L2 error of approximation
kf ; f~N k2 C N s=dk'kBpqs :
;
(6.8)
Remark 1. In fact, the result (6.8) is probably sharp: that is for any s0 > s, there must
be a function ' 2 Bpq
s with, say, k'kB s = 1 such that
pq
lim sup kf ; f~N k2 N s0=d = 1: (6.9)
At the end of this section, we will argue that there is some strong evidence supporting this
conjecture. This suggests that neural networks are also subject to the curse of dimension-
ality: to obtain approximation to f at rates N ;r , we must assume that f is r d times
di erentiable.
Remark 2. We remark that the rate of convergence is identical to the one obtained
using truncated wavelet expansions. This remarkable fact will be examined further in the
discussion section.
Proof of Theorem. We start by proving the upper bound. Notice that by construction
the collection
f'(t ; kb0) 2j=2 (2j t ; kb0 ) j 0 k 2 Zg
is a frame of L2 $;1 1]. Now, for an univariate function g : R ! R, let jk = hg jk i and
k = hg k i. Although we do not give a proof here, the quantity
X
kf kbspq k(k )k k`p + k(2j 2j(1=2 ;1=p)
( jjkjp)1=p)j 0k`q (6.10)
k
s , for any g supported in $;1 1].
is an equivalent norm to the norm of f in Bpq
As we suppose that f (x) = g(kxk) is supported in the unit ball, we recall that
= hf i = hRuji f jki = jSd 2jhTdg jk i = jSd 2jjk(Tdg):

; ;
For a g in Bpq s , we know that its image Td g is in Bpq s+(d;1)=2 in addition, its norm
kTdgkbspq+(d;1)=2 is bounded by kgkBpqs . Further, it is fairly clear that for any q > 0,
kTdgkbsp1d; = kTdgkbspq d; = C kgkbspq :
+( 1) 2 +( 1) 2
Following Donoho (1993), one may bound the number of coe cients, at a given scale j ,
whose absolute values exceed a certain threshold > 0 as displayed below:
#fk jjk j > g cb 0 min(2j (2;j (s+(d;1)=2+1=2;1=p) kTd gkbsp+(1d;1)=2 ;1

)p ):
Thus, the number of ridgelet coe cients at scale j that exceed in absolute value is
larger than > 0 is bounded as follows:
#f j j > and j j = j g

cb0 C 2j(d;1) min(2j (2;j(s+(d;1)=2+1=2;1=p) kTdgkbsp+(1d;1)=2 ;1
)p )
= cb0 C min(2jd (2;j (s+d(1=2;1=p) kTd gkbsp+(1d;1)=2 ;1 )p ) (6.11)
since the number of directions uji is bounded by C 2j (d;1) for any scale j . Summing (6.11)
over scales j , a bit of algebra shows that (provided s > d(1=p ; 1=2)+ )
#f j j > g C kTdgkbsp1d; = 1

+( 1) 2
; s=d+1=2 : (6.12)
where C possibly depends upon d s p and the sampling resolution b0 . Let p be de ned
by 1=p = s=d + 1=2. The inequality (6.12) merely states that the weak-`p norm of
is bounded by kTd gkbsp+(1d;1)=2 up to a multiplicative constant. The supporting Lemma 15
leads to the desired conclusion.
Of course, one would like to have a converse statement or, in other words, give lower
bounds of the squared error of approximation kf ; f~N k2 . In the remarks directly following
the statement of our theorem, we have articulated a reasonable conjecture (6.9) expressing
the sharpness of the bound (6.8). Unfortunately, we are not able to prove this conjecture
yet. However, we have available a recipe for constructing radial functions whose ridgelet
coe cients sequences are not sparse. For instance, let p0 be a real such that 1=p0 > 1=p =
s=d + 1=2. Making use of Proposition 11 and Lemma 2 in Donoho (1993), one can nd a
' 2 Bpq
s (k'kB s = 1) such that the sequence of ridgelet coe cient is not in w`p0 . Now,
pq
the problem is that we don't have a converse to Lemma 15: that is, if we know something
about the rate of approximation by nite linear combinations of ridgelets (or their duals),
does it imply that the ridgelet coe cients have a certain decay? Let us be even more precise.
P
Suppose f is such that there is a sequence fN = Ni=1 iN i N with the property
kf ; f~N k2 C N ;r
:
Does it imply that fhf ig 2 w`p (;d ) with r = (1=p ; 1=2)+ ? In Approximation Theory,
this is often referred to as a Bernstein type of inequality whereas Lemma 15 is a kind of
abstract Jackson inequality. Suppose the ridgelets were orthogonal then the Bernstein
inequality would be trivial. The delicate issue here is, of course, that the ridgelets are
possibly linearly dependent. Work in this direction is in progress.
6.3 Examples
In this section, we consider a class of radial functions (f ) de ned by f (x) = (1 ;kxk2 )+ .
Away from the sphere of radius r = 1, these functions are smooth but are singular across
this same sphere. Here, the index parameter is simply the degree of the singularity (see
De nition 6). In all that follows, we will suppose > ;1=2, so that f is square integrable.
First, a simple calculation shows that
Z1
Ru f(t) = jS d;2 j t (r2 ; t2)(d 3)=2 (1 ; r2)r dr
;
1 (1 ; t2 )+(d 1)=2 Z 1 (1 ; v)(d 3)=2 v dv

jj
= 2
; ;
0
= cd (1 ; t2 )+(d 1)=2 :
;
We now check the sparsity of the representation of f in a ridgelet frame. Again, suppose
that we are using ridgelets such that their pro les ' and have compact supports. We
will assume that ' and are R times di erentiable and that has vanishing moments
through order D. Finally let p be de ned by 1=p = 1=2 + ( + 1=2)=(d ; 1). We
show that the coe cients are in w`p . If min(R D) is large enough (min(R D)
max(2=p 2 + + (d ; 1)=2) su ces), we have
Z
= hf i = cd (1 ; t2 )+(d;1)=2 2j=2 (2j t ; kb0 )
;
Cd 2;j (1=2++(d;1)=2) 1 + jkb0 j ; 2j ;2=p
;
= Cd 2;j (d;1)=p 1 + jkb0 j ; 2j ;2=p : (6.13)
The proof of the rst inequality is an application of integration by parts and we omit it.
So, let > 0 and let us bound the number of coe cients that are greater or equal to
in absolute value. For a given scale j and orientation uji , it follows from (6.13) that
#fk j j gC2 ;j (d;1)=2 ;p =2 (6.14)
where the constants may depend on the parameter b0 . Moreover, (6.13) implies that
#fk j j g = 0 whenever j verify
Cd 2;j(d;1)=p < , Cd

p =2 2 ;j (d;1)=2
< p =2 :
Then, of course, if we x the scale j , we trivially get that
#f with j j = j j j g C 2j(d ;1)=2 ;p =2 1 j(d;1)=2

f2 (Cd ;1 )p=2 g :
And summing across scales gives
#f j j g = O( ;p
)
which estabishes the claim.

As a consequence of the last display, a direct application of Lemma 15 gives
Proposition 12 Suppose > ;1=2 and let f be the spherical singular function f (x) =
(1 ; kxk2 )+ . As in the preceding section, let f~N be the truncated ridgelet expansion of f .
Then,
kf ; f~N k2 = O(N ;(+1=2)=(d;1)

): (6.15)
Suppose one were to use wavelets to approximate the singular function f from the above
) ( = (j k) j 0 k 2 Zd ) be a nice isotropic wavelet basis in
Proposition. So let ('k jk
Rd with compact support, enough regularity and vanishing momemts. Standard techniques
show that for any > 0, we have
jhf ij C ( )2;j ;j

2
;1 + kkk ; 2j ;
:
As a straightforward consequence of the above estimate, one can see that the best wavelet
approximation converges at the rate O(N ;(+1=2)=(d;1) ). In this case, it can be shown that
this rate cannot be improved.
6.4 Discussion
In this chapter, we have demonstrated in several places that ridgelets and multi-dimensional
wavelets were equally as good, at least asymptotically, to approximate radial functions. For
instance, in the previous section, we have shown explicitly that the rate of approximation
of, say, f (x) = (1 ; kxk2 )+ , for > ;1=2 using ridgelets coincides with the one obtained
using multi-dimensional wavelets { regardless of the degree of the singularity and/or of
the dimension d. In fact, this conclusion is not limited to the case of radial functions. For
instance, one could consider a simple extension of our radial example: let F be the set of
all di eomorphisms h : Rd ! Rd , with, say, h 2 C r for some r > 0 and such that all
the partial derivatives of h up to order r are bounded by some constant C . Now, for any
h 2 F , consider g = f h. So, g is a smooth function away from a hypersurface ) with
a minimum of curvature and singular across ) (roughly speaking, one sees a singularity of
degree when one crosses ) along a normal vector to the surface). It then turns out that if
r is large enough, the conclusion of Proposition 12 remains unchanged for this larger class
of functions. That is, the rate of approximation of g by ridgelets is O(N ;(+1=2)=(d;1) ) and
so is the rate of convergence of wavelet approximations.
In the author's opinion, this result is most remarkable and surprising. The two dis-
cussed approximation schemes are, of course, very much unlike: as one is building up an
approximation using oscillatory bumps at various scales whereas the other is using ridge
functions localized around strips of various scales. And yet, both seem to represent func-
tions that are smooth away from curved hypersurfaces, but that may be singular across
these hypersurfaces, with the same degree of accuracy.
We close this chapter by pointing out that our ndings seem to contradict { at least
at a super cial level { the results obtained by Donoho and Johnstone (1989). In fact, the
model they consider is somewhat di erent as they study the squared error of approximation
with respect to the gaussian measure whereas we are concerned with compactly supported
functions. Hence, their results are not inconsistent with ours. First, the behavior at in nity
of the functions they are treating may account for the divergence of our conclusions. Second,
their study only concerns the case of R2 and, perhaps, the e ects they describe are speci c
to this case.
Chapter 7
Concluding Remarks
7.1 Ridgelets and Traditional Neural Networks
The purpose of this section is to show that ridgelet approximations o er decisive superiority
over traditional neural networks. We recall that in the latter case one considers nite
approximations from DNN = f(k x ; b) k 2 Rn b 2 Rg, where is the univariate
sigmoid. As we have already shown, neural networks cannot outperform ridgelets over the
s (C ). However, outside of this paradigm, it is of interest to compare for any
classes Rpq
function f the rates of approximations using either DNN or DRidgelets .
In Chapter 5, we de ned the approximation error of a a function f by N -elements of
the dictionary D:
X
N
inf inf kf ; i gi kH dN (f D):
(i )Ni=1 (i )Ni=1 i=1
In many interesting cases, dN (f D) N ;r or dN (f D) N ;r log (N ). A crude measure

of convergence is the exponent of approximation.
r (F D) = supfr dN (f D) = O(N ;r )g (7.1)
This measure is, indeed, insensitive to log-like factors. Now, one would like to know if there
exists a function f for which
r (f DNN ) > r (f DRidgelets ):
98
CHAPTER 7. CONCLUDING REMARKS 99
The answer to this question is generally negative. In order to give a precise statement,
we only consider nite linear approximations with bounded coe cients: i.e., we restrict the
minimization (7.1) to ji j C for some positive C . In practice, this restriction makes perfect
sense since one is not going to store unbounded coe cients. Also, note that the relaxed
greedy algorithm (1.2)-(1.3) produces coe cients bounded by 1. With this restriction, one
can show
Theorem 13 For any given f ,
r (f DNN ) r (f DRidgelets ): (7.2)
Whenever traditional neural networks give a good approximation of f , there is at least an

equally good - and perhaps much better - approximation using ridgelets. In short, there
seems to be very little advantage in using traditional neural activation functions.
Proof of Theorem. The idea behind the theorem is very simple: any neuron (k x ; b)
can be approximated in O(N ;s ) for any s > 0 using the dual ridgelet dictionary therefore,
one can get a very good approximation of any nite linear combination of such neurons.
Step 1. Let 0 = (a0 u0 b0 ) be in ; = R+ Sd;1 R and let 0 be the neuron de ned
by 0 (x) = (a0 (u0 :x ; b0 )). Consider now a ridgelet frame for L2 (2d ) as in Chapter 3,
section 3.6.1, and a d dimensional window w, 0 w 1, supp w 2d and wjd = 1. The
frame decomposition theorem allows us to write
X
0 (x)w(x) = 0 ~ (x) (7.3)
2;d
in the sense of L2 (2d ). From the exact series displayed above, extract { as usual { the
nite sum 0 N corresponding to the N largest coe cients. The rst part of the proof
consists of proving that
sup k0 ; 0 N kL2 (d ) CN ;s (7.4)

0
for any s > 0. To prove the claim, we will try to use a maximum number of results proved
in the previous chapters.
Let us consider for a moment the Meyer wavelet basis (Daubechies, 1992) ('~0k ~jk )
on the real line. For scalars a0 > 0 and b0 , a0 b0 will denote the univariate function
(a0 (t ; b0)). As one can guess, the wavelet coe cients sequence of a0 b0 w is extremely
sparse. We claim that
(
C 2;j=2(1 + jk ; 2j b0 j); a0 2j
jha b jkij
0 0
Ca02;j 2;j=2 (1 + jk ; 2j b0 j); a0 2j
:
First, notice that (t) = (1 + e;t );1 = 1=2 + (t) where , according to De nition 6, is an
-singularity of degree 1 (up to a renormalizing constant). To see this, remark that j(t)j
is obviously bounded by jtj and an induction argument shows that for n 1,
dn (t) C (n)e ;jtj

dtn
which su ces to establish the singularity property. Then, for any positive a0 , a;0 1 (a0 (t ;
b0)) is also a singularity of degree 1. For a singularity of degree 1, we know that
jha0 1(a0(t ; b0)) jk ij C 2
; ;3j=2
(1 + jk ; 2j b0 j);
where the constant C may depend on . Then, of course, jk is orthogonal to the constant
function hence,
jha b jkij C a02

0 0
;3j=2
(1 + jk ; 2j b0 j);
which is the part of the result corresponding to the case a0 2j . As far as the case a0 2j
is concerned, we observe that
d (t) = a 0 (a (t ; b ))
dt a0 b0 0 0 0
and, therefore, the above derivative is bounded in absolute value by a0 e;a0 jt;b0 j. Now since
has one vanishing moment, a simple integration by parts gives
jha b w jk ij C 2
0 0
;j=2
(1 + jk ; 2j b0 j);
which proves the second part of the claim. It follows from the previous inequalities that for
every choice of p, kk`p is nite and uniformly bounded (over all the possible choices of a0
and b0 ). Furthermore, it is trivial that k = ha0 b0 '~0k i satis es jk j k'k1 .
Now, back in terms of our frame decomposition, the property we have just shown is very
interesting since it implies that the sequence (0 ) 2;d in (7.3) is also uniformly bounded
(over all choices of 0 ) in `p. This merely follows from Chapter 5 (section 5.2.1) as it is a
consequence of Lemma 10. Then, Lemma 15 allows us to conclude and establish the validity
of (7.4).
Step 2. We are now in a position to nish up the proof of the theorem. We show that for
any r < r (f DNN ) there is a sequence fN of N -term ridgelet expansions with the property
kf ; fN k = O(N ;r
):
It is not di cult to construct such a sequence since by assumption we know the existence
P P
of a sequence gm (x) = mi=1 im (kim x ; bim ) (that we shall write gm = i im im ),
such that
kf ; gmk2 Cm ;r 0
with, say, r0 = (r + r (f DNN ))=2. For a xed N , let m be de ned by m = N 1; , where
(1 ; )r0 = r. Next de ne fN as follows:
X
m X
fN = im im N :
i=1
Then, of course, fN is a ridgelet expansion of order N terms and
kf ; fN k kf ; gmk + X jimjkim ; imN k

m
i=1
;r0
= O(m ) + mO(m;s )
= O(N ;(1; )r0 ) + O(N ;(1; )(s;1) ) = O(N ;r )
since s may be chosen arbitrarily large (in particular s ; 1 > r0 .) This nishes the proof of
the theorem.
Remark. In fact, the proof of the theorem shows that the assumption requiring the
coe cient i to be bounded may be dropped. The same result is true if we restrict the
coe cients to grow polynomially: that is, in the best approximation (7.1) we only require
jij BN for some constants B and .
7.2 What About Barron's Class?

In Chapter 1, we did mention a well-known result from Barron (1993) { see section 1.2 we
recall that in his paper, Barron studies rates of approximation of a special class of square-
integrable functions supported in the unit cube Q of Rd by nite superpositions of elements
taken from DNN : namely, he considers objects satisfying
Z
kf kB Rd
kkkjf^(k)j dk C: (7.5)
In this section, we will refer to this class as the Barron class B (C ) and to k kB as the
Barron norm. His work shows that
dN (B (C ) DNN ) = O(N ;1=2 ): (7.6)
In fact, the Fourier dictionary is optimal for approximating elements of B (C ). (This is

not surprising since the class B (C ) is de ned by means of the Fourier transform.) Indeed,
one can show that
dN (B (C ) DFourier ) = O(N ;1=2;1=d ) (7.7)
where the Fourier dictionary is, of course, DFourier fei2 nx n 2 Zd g. Moreover, thresh-
olding the Fourier expansion (keeping the N terms corresponding to the N largest Fourier
coe cients) gives the optimal rate of approximation. In the next paragraph, we will show
how to use the tools developed in this thesis to prove that (7.7) is indeed a lower bound. As
far as the upper bound is concerned, observe that the Barron norm is actually equivalent
to
kf kB X knkjf^(2 n)j
n2Zd
as it is a simple consequence of a famous theorem due to Plancherel and P'olya (1938). Next,
a bound on the right-hand side of the above display implies a bound on the weak-`1+1=d
quasi-norm (5.20) of the sequence (jf^(2 n)j)n2Zd . Thus, Fourier series outperform neural
networks over the B (C ) model of functions.
In his analysis, Barron shows that the rate (7.6) may be obtained with expansions
P
of the form fN (x) = Ni=1 i (ki x ; bi ), where the i 's may be restricted to satisfy
PN j j C 0 for some constant C 0. Hence, a straightforward application of Theorem
i=1 i
13 gives at the minimum dN (B (C ) DDual;Ridge ) = O(N ;s ) for any s < 1=2. Moreover,
a bit of ridgelet analysis shows that the rate of approximation of the Barron class by any
reasonable dictionary (see Chapter 5, section 5.1) is bounded below by O(N ;1=2;1=d ). The
reason is the following: let us consider a pair (' ) satisfying the conditions (i)-(iv) listed
at the very beginning of Chapter 4 together with (2.7) namely,
j'^()j2 + X j ^(2 j )j2 = 1:
;
jjd 1 j 0 j2 j jd 1
; ; ;
Then we have,
Z
kf kB = kkkjf^(k)j dk
Z Z
R d
= jf^(u)jjjd ddu
>0 Sd;1
Z Z X j(d 1) Z Z ^ ^ j 2
=
>0 Sd;
jf (u)jj'^()j jjddu + 2
^
1
2
>0 Sd;
jf (u)jj (2 )j jj ddu
;
1
;
j 0
Z Z 1=2
j jjj
'^( ) 2 2 ddu jf^(u)j2 j'^()j2 ddu
X Z Z 1=2
+ 2j (d;1) j ^(2;j )j2jj2 ddu jf^(u)j2j ^(2;j )j2 ddu
0Z j 0 1=2 X Z 1=2 1
C @ jf^(u)j2j'^()j2 ddu + 2j(d+1=2) jf^(u)j2 j ^(2;j )j2 ddu A
j 0
= C kf kR2d=12+1 : (7.8)
Then, it follows from (7.8) that the Barron class B (C ) contains R2d=12+1 (C1 ) for some con-
stant C1 , and since the rate of approximation of the latter class is bounded below by
O(N ;1=2;1=d ) (Theorem 9), so is B (C ). We hope that this example emphasizes further the
use and the potential power of the ridgelet analysis that we have developed in this thesis.
7.3 Unsolved Problems

Along the way, we have encountered a couple of issues that were left unanswered or un-
proved. In this section, we address a speci c one.
One might be actually interested in synthesizing functions out of the ridgelet dictionary
rather than out of the dual-ridgelet dictionary for approximating a target f . For instance,
instead of considering N -term expansions of the form
X
f~N = 1fj j jjN g
~
2;
one may want to consider expansions like

X
fN = ~ 1fj~ j j
~ jN g
2;
where (respectively ~ ) stand for the ridgelet coe cients hf i (respectively the dual
ridgelet coe cientshf ~ i). A natural question is whether or not these two approaches
generally give the same results. That is, for a given f are the approximation errors kf ; f~N k
and kf ; fN k of about the same order? We know that
2 w`p ) kf ; f~N k = O(N ;r

) for 1=p = r + 1=2 and
~ 2 w`p ) kf ; fN k = O(N ;r
) for 1=p = r + 1=2
and, thus, the problem unquestionably takes place at the coe cient level. In this thesis, we
have mainly studied dual-ridgelet expansions because one has a direct access to the ridgelet
coe cients since the ridgelets are known explicitly. It is much more delicate to work out
similar estimates for the dual coe cient sequence.
In this direction, it would be interesting to know if both sequences have the same
structure that is, let k k be a norm on the sequence space { like (5.14) for example. Is
the norm based on the ridgelet coe cient sequence equivalent to the one based on the dual
sequence? Especially, in view of the previous display, one would like to know, for example,
if for a xed p > 0,
kk`p k~k`p (7.9)

the aim being to transfer knowledge about the size and/or organization of to ~ . In fact,
one direction of (7.9) turns out to be easy. To x ideas, suppose we are given a frame of
L2 (2d ) and a function f in L2(d ), such that ~ 2 `p. Let 0 w 1 be a C 1 window,
supported in 2d , so that its restriction to d is 1. From the frame decomposition
X X
f = fw = ~ = ~
2;d 2;d
it follows that
X X
= hw i
0 ~ 0 and ~ = hw ~ ~0 i0 :
0 2;d 0 2;d
Now, the estimates obtained in section 5.2.1 imply that for any p > 0,
kk`p Cp k~k`p
provided that has a su ciently large number of derivatives and vanishing moments. To
establish the converse, it would be su cient to show that, say,
X
w ~ = 0 0
0
with
X X
sup
0
j 0 jp _ sup0

j 0 jp Cp

which expresses that the dual frame has a sparse representation in the original one. The
author has proven some results in this direction but they are too fragmentary to be reported.
7.4 Future Work

7.4.1 Nonparametric Regression
We anticipate the translation of our approximation results to statistical estimation results.
Such connections have been made clear in section 1.3. For example, as in Donoho and
Johnstone (1995), consider the following white noise model
Y(dx) = f (x)dx + W (dx) x 2 $0 1]d :

Here, f is the object to be recovered and W (dx) is the standard d-dimensional white noise.
For a class F of objects, let R (F ) be the minimax risk
R(F ) = inffb sup Ekfb ; f k22:

F
Following ideas developed in Donoho, Johnstone, Kerkyacharian, and Picard (1995) and
Donoho and Johnstone (1995), we expect to derive the asymptotics of R (as ! 0) for
s . More importantly, consider estimators of the form
the classes Rpq
X
fb = (h Y i) ~
2;d
where ( ) 2;d is a nice ridgelet frame and where the functions are scalar nonlinearities
(hard/soft-thresholding, etc.) depending upon and the parameters s p q. We then ex-
pect such estimators to be adaptively nearly minimax, possibly within log-like factors, for
estimating objects from these classes. In short, thresholding the noisy ridgelet coe cients
should be nearly optimal for the estimation of objects exhibiting the spatial inhomogeneities
described in Chapter 4.
7.4.2 Curved Singularities

Finally, ridgelets are optimal for representing objects with singularities across hyperplanes
(Chapter 5), but they fail to represent e ciently objects with singularities across curved
hypersurfaces (Chapter 6). Work in progress investigates possible re nements of the ridgelet
construction to handle curved singularities we hope to report later on this program.
Appendix A
Proofs and Results
Chapter 3
Proof of Proposition 5.
Let be the ridgelet coe cient hf i. We recall that = (Rij f j )(bk2;j ) (and
in case j = ;1, the convolution is in fact with '). We will use the polar notation in its
standard form i.e. f (t ) = f (t cos t sin ). From Lemma 1, we got
X 1 Z
j j ;2
jij k 2 b0 R jf ( i )j j (2 )j d
^ j 2 ^ ;j 2
k sZ sZ
21 R
j j j ^(2 j )j2d
f^( ij ) 2 ;
R
j f^( ij )j2 j2 j j2 j ^(2 j )j2 d:
; ;
Mimicking the argument of Lemma 1 (large sieve principle), we have

X j Z Z d f^( )jd
jf^( ij )j2 ; 220 02 ]
j f^( )j2 d
02 ]
j f^( )jj d (A.1)
i Z
02 ]
jf^( )jjjkrf^( )kd
sZ sZ
02 ]
jf^( )j2jjd 02 ]
krf^( )k2jjd
where the rf^ is the gradient of the 2-dimensional Fourier transform of f .
As a consequence of (A.1), we have a sampling result for the angular variable of the
107
APPENDIX A. PROOFS AND RESULTS 108
same avor.
Z XX j ZZ X
jf^( ij )j2j ^(2 j )j2 ; 2 20
;
02 ] j
jf^( )j2j ^(2 j )j2dd
;
(A.2)
j i
ZXZX 2j Z
jf^( ij )j2j ^(2 j )j2 ; 2
;
0 02 ]
j f^( )j2 j ^(2 j )j2 d d
;
j i
Z X sZ sZ !
02 ]
j jjf^( )j2 j ^(2 j )j2 d
;
02 ]
kr f^( )k2 j jj ^(2 j )j2 d
; d
j
Z 0sZ X sZ X
1
@ 02 ] jjjf^( )j2 j ^(2;j )j2d 02 ] krf^( )k2jj j ^(2;j )j2dA d
j j
sZ Z X s Z Z X
02 ]
j jjf^( )j2 j ^(2 j )j2 dd
;
02 ]
kr f^( )k2 j j j ^(2 j )j2 dd:
;
j j
P
Adding the coarse scale term amounts to substituting j j ^(2;j )j2 by j'^( )j2 + j j ^(2;j )j2
P
in the last inequality. Recall that by assumption we have
j'^()j2 + X 2j j ^(2 j )j2 = jj;
j 0
and suppose
X
sup j'^( )j2 + j ^(2 j )j2 B:
;
The last inequality of (A.2) together with the conditions on ' and imply
Z XX j Z
jf^( ij )j2j ^(2;j )j2d ; 22 0 02 ]
jf^( )j2j ^(2 j )j2dd
;
j
Z iX X
= jf^( ij )j2j ^(2 j )j2d ; 2 20 kf^k2
;
sZj i
sZ
B jjjf^( )j2dd krf^( )k2 jjdd
Bp2kf^k22kf^k2 = 2p2Bkf^k22: (A.3)
Again, using A.1, one can check that

X 1 Z XX
j j ; 2
2
b0 R jf^( ij )j2j ^(2 j )j2d
;
j i
XX 1 sZ sZ
2 R
jf^( ij )j2j ^(2 j )j2d
;
R
jf^( ij ))j2j2 j j2j ^(2 j )j2d
; ;
sj iZ XX sZ X X
21 R j
j f^( ij ) 2 j j ^(2 j;j ) 2 d
R j
jf^( ij )j2j2 j j2j ^(2 j )j2d:
; ;
i i
But we have just proved (A.3) that

Z XX
R j
jf^( ij )j2j ^(2 j )j2d 2 20 kf^k22 + 2p2Bkf^k22:
;
Similarly, one can show without e ort that

Z XX
jf^( ij )j2j2 j j2j ^(2 j )j2d 22B0 kf^k22
0
; ;
R j i
for some constant B 0 only depending on . Therefore, we can deduce from all this that
X 1 Z XX
j j ; 2
2
b0 R jf^( ij )j2j ^(2 j )j2d C 10 kf^k22
;
(A.4)
j i
for some C .
Finally, combining (A.3) and (A.4) gives
X
j j2 ; (2 2 kf^k2
)2 b0 0 2 2 C0 kf^k22 + bC0 kf^k22

which is the desired result.
Chapter 4
Proof of Proposition 8. The fact that Rpqs (Rd ) is a normed space is trivial. We now prove
the completeness. Let ffn gn 0 be a Cauchy sequence in Rpq s (Rd ). Then, of course, ffn gn 0
is a Cauchy sequence in L1 (Rd ) and, therefore, converges to a limit f in L1 . Let wj (f )(u b)
(resp. v(u b)) be (Ru f j )(b) (resp. (Ru f ')(b)). It follows that wj (fn )(u b) converges
to wj (f )(u b) a.e. (and similarly for v(fn )(u b)). On the other hand, fwj (fn )(u b)gn 0 is a
Cauchy sequence in Lp (Sd;1 R) and therefore the limiting function, which coincides with
wj (f )(u b), is in Lp(Sd;1 R). Now, it follows by standard arguments that f belongs to
s (Rd ) and that fn converges in R s (Rd ) to f . Hence, R s (Rd ) is complete.
Rpq pq pq
Chapter 6
Lemma 16 The operator Ud dened by
Z1
(Ud g)(t) = (r ; t)(+d;3)=2 g(r) dr
0
s+(d;1)=2 (0 1).
s (0 1) to Bpq
is bounded from Bpq
Proof of Lemma. Let ('j0 k ) jk , 0 k < 2j j j0 , be a wavelet basis on the interval
(0 1). We will assume that ('j0 k ) jk , are R times di erentiable and are of compact
support (support width C 2;j .) In addition, we will suppose that the jk 's have vanishing
moments through order D. We will denote by j0 k = hf 'j0 k i and jk = hf jk i the
wavelet coe cients of f . We recall that each Besov space Bpqs (0 1) with (1=p ; 1) < s <
min(R D) and 0 < p q 1 is characterized by its coe cients in a sense that
X
kf kBpqs (01) k(j k )k k`p + k(2j 2j(1=2
0
;1=p)
( jjkjp)1=p)j j k`q
0 (A.5)
k
is an equivalent norm to the norm of Bpq s (0 1).
Further, suppose one wants to bound the norm of an operator T : Bpq s (0 1) ! B (0 1)
pq
for all possible choices of 1 p q 1. (The norm being de ned as sup kTf kBpq =kf kBpq
s .)
The theory of interpolation tells that it is su cient to check the boundedness for cases
(p q) 2 f(1 1) (1 1) (1 1) (1 1)g. That is what we shall verify for Ud . Now let
R R
gj0 (t) = (y ; t)(+d;3)=2 'j0 0 (y) and gj0 (t) = (y ; t)(+d;3)=2 j0 0 (t). We have for t 2 R
Z
gj0 (t) = (y ; t)(+d;3)=2 2j0=2 '(2j0 y):
Moreover, for ` min(R D) we have
Z
gj0 (t) = (y ; t)(+d;3)=2 2j0 =2 (2j0 y):
Then of course,
;2
j ( )j 2
0 ; min(jj 0 )(d;1)=2 ;jj ;j 0 j(n+1=2)
2 1 + 2min(jj 0 ) jk2;j ; k0 2;j 0 j :
From this last inequality, it follows that
(
X C 2;j 0(d;1)=2 2;(j;j0)(n;1=2) j 0 j
sup
k0 2j0 k2j
j ( )j 0
C 2;j(d;1)=2 2;(j ;j 0 )(n+1=2) j 0 j
(A.6)
and
(
X C 2;j0(d;1)=2 2;(j;j0)(n+1=2) j 0 j
sup
k 2 j k 0 2 0
j ( )j 0
C 2;j(d;1)=2 2;(j0 ;j )(n;1=2) j 0 j
: (A.7)
j
Case (1 1). In this case we have that

X j(s+(d
kUdkB s B s
11
+(d 1)=2
11
; sup
0
2 ;1)=2;1=2) ;j 0 (s;1=2)
2 j ( )j:
0

But (A.7) yields
X
sup0 2j (s+(d;1)=2;1=2) 2;j 0 (s;1=2) j ( 0 )j
X X
sup0
j0 j j 0
2;(j 0 ;j )(s;1=2) 2;(j ;j 0 )(n+1=2) + 2(j ;j 0 )(s+(d;1)=2;1=2) 2;(j ;j 0)(n;1=2)
j>j 0
C
where the last inequality holds as long as n > s+(d;1)=2. This fact implies the boundedness
for (1 1).
Case (1 1). This time,
kUdkB1s 1B1s 1d; = sup X 2j(s+(d

+( 1) 2
;1)=2+1=2) ;j 0 (s+1=2)
2 j ( )j: 0
0
Now, (A.6) gives

X
sup 2j (s+(d;1)=2+1=2) 2;j 0 (s+1=2) j ( 0 )j
0
X X
sup
j0 j 0 j
2(j ;j 0 )(s+(d;1)=2+1=2) 2;(j ;j 0 )(n+1=2) + 2;(j 0 ;j )(s+1=2) 2;(j 0 ;j )(n;1=2)
j 0 >j
C
where, again, the last inequality is veri ed if n > s + (d ; 1)=2: We obtain the boundedness
for (1 1).
Case (1 1). This time
X j(s+(d X
kUdkB s1B s1d; = jsup
1 1
+(
0 j
2
1) 2
;1)=2;1=2) ;j 0 (s;1=2)
2 sup
k2j k0 2j0
j ( )j:
0
0 j j0
Again, similar calculations as in the above two cases give

X X
sup
0
j j0 j j0
2j (s+(d;1)=2;1=2) 2;j 0 (s;1=2) sup
k2j k0 2 0
j ( )j
0
X Xj
2(j ;j 0 )(s;1=2) 2;(j 0 ;j )(n;1=2) + 2(j ;j 0 )(s+(d;1)=2;1=2) 2;(j ;j 0 )(n+1=2)
j0 j j 0 j>j 0
C
where again, the last inequality is veri ed if n > s + (d ; 3)=2:
The case (1 1) is analogous. The lemma is proved.
References
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. Inform. Theory, 39, 930{945.
Benveniste, A., and Zhang, Q. (1992). Wavelet networks. IEEE Transactions on Neural
Networks, 3, 889{898.
Bernier, D., and Taylor, K. F. (1996). Wavelets from square-integrable representations.
SIAM J. Math. Anal, 27, 594{608.
Boas, R. P., Jr. (1952). Entire functions. New York: Academic Press.
Candes, E. J. (1996). Harmonic analysis of neural netwoks. To appear in Applied and
Computational Harmonic Analysis.
Cheng, B., and Titterington, D. M. (1994). Neural networks: a review from a statistical
perspective. With comments and a rejoinder by the authors. Stat. Sci., 9, 2{54.
Conway, J. H., and Sloane, N. J. A. (1988). Sphere packings, lattices and groups. New
York: Springer-Verlag.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math.
Control Signals Systems, 2, 303{314.
Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia, PA: Society for Industrial
and Applied Mathematics.
Daubechies, I., Grossmann, A., and Meyer, Y. (1986). Painless nonorthogonal expansions.
J. Math. Phys., 27, 1271{1283.
113
REFERENCES 114
Deans, S. R. (1983). The Radon transform and some of its applications. John Wiley &
Sons.
DeVore, R. A., Oskolkov, K. I., and Petrushev, P. P. (1997). Approximation by feed-forward
neural networks. Ann. Numer. Math., 4, 261{287.
DeVore, R. A., and Temlyakov, V. N. (1996). Some remarks on greedy algorithms. Adv.
Comput. Math., 5, 173{187.
Donoho, D. L. (1993). Unconditional bases are optimal bases for data compression and for
statistical estimation. Applied and Computational Harmonic Analysis, 1, 100{115.
Donoho, D. L. (1996). Unconditional bases and bit-level compression. Applied and Com-
putational Harmonic Analysis, 3, 388{392.
Donoho, D. L. (1997). Fast ridgelet transform in two dimensions (Tech. Rep.). Department
of Statistics, Stanford CA 94305{4065: Stanford University.
Donoho, D. L., and Johnstone, I. M. (1989). Projection-based approximation and a duality
with kernel methods. Ann. Statist., 17, 58{106.
Donoho, D. L., and Johnstone, I. M. (1995). Empirical atomic decomposition.
Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Wavelet
shrinkage: asymptopia? J. Roy. Statist. Soc. Ser. B, 57, 301{369.
Du n, R. J., and Schae er, A. C. (1952). A class of nonharmonic Fourier Series. Trans.
Amer. Math. Soc., 72, 341-366.
Du o, M., and Moore, C. C. (1976). On the regular representation of a nonunimodular
locally compact group. J. Functional Analysis, 21, 209{243.
Feichtinger, H. G., and Gr!ochenig, K. (1988). A uni ed approach to atomic decomposi-
tions via integrable group representations. In Function spaces and applications (Lund,
1986). Berlin-New York: Springer.
Frazier, M., Jawerth, B., and Weiss, G. (1991). Littlewood-Paley theory and the study of
function spaces (Vol. 79). Providence, RI: American Math. Soc.
REFERENCES 115
Friedman, J. H., and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist.
Assoc., 76, 817{823.
H!ardle, W. (1990). Applied nonparametric regression. Cambridge, England: Cambridge
University Press.
Hasminskii, R., and Ibragimov, I. (1990). On density estimation in the view of Kolmogorov's
ideas in approximation theory. J. Amer. Statist. Assoc., 18, 999{1010.
Holschneider, M. (1991). Inverse Radon transforms through inverse wavelet transforms.
Inverse Problems, 7, 853{861.
Huber, P. J. (1985). Projection pursuit, with discussion. Ann. Statist., 13, 435{525.
Jones, L. K. (1992a). A simple lemma on greedy approximation in Hilbert space and
convergence rates for projection pursuit regression and neural network training. Ann.
Statist., 20, 608{613.
Jones, L. K. (1992b). On a conjecture of Huber concerning the convergence of projection
pursuit regression. Ann. Statist., 15, 880{882.
Katznelson, Y. (1968). An introduction to harmonic analysis. New York: Wiley.
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function.
Neural Networks, 6, 861{867.
Mallat, S., and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE
Transactions on Signal Processing, 41, 3397{3415.
Meyer, Y. (1992). Wavelets and operators. Cambridge University Press.
Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic
functions. Neural Computation, 8, 164{177.
Mhaskar, H. N., and Micchelli, C. A. (1995). Degree of approximation by neural and
translation networks with a single hidden layer. Adv. in Appl. Math., 16, 151{183.
Montgomery, H. L. (1978). The analytic principle of the large sieve. Bull. Amer. Math.
Soc., 84, 547{567.
REFERENCES 116
Murata, N. (1996). An integral representation of functions using three-layered networks

and their approximation bounds. Neural Networks, 9, 947{956.
Natterer, F. (1986). The mathematics of computerized tomography. B. G. Teubner John
Wiley & Sons.
Pati, Y. C., and Krishnaprasad, P. S. (1993). Analysis and synthesis of feedforward neural
networks using discrete a ne wavelet transformations. IEEE Transactions on Neural
Networks, 4, 73{85.
Peyrin, F., Zaim, M., and Goutte, R. (1993). Construction of wavelet decompositions for
tomographic images. J. Math. Imaging Vision, 3, 105{122.
Plancherel, M., and P'olya, G. (1938). Fonctions enti2eres et int'egrales de Fourier multiples.
Commentarii Math. Helv., 10, 110{163.
Stein, E. M. (1970). Singular integrals and dierentiability properties of functions (Vol. 30).
Princeton, N.J.: Princeton University Press.
Triebel, H. (1983). Theory of function spaces. Basel: Birkh!auser Verlag.
Wagner, G. (1993). On a new method for constructing good point sets on spheres. Discrete
Comput. Geom., 9, 111{129.
Young, R. M. (1980). An introduction to nonharmonic Fourier series. New York: Academic
Press.

Ridgelets Theory and Applicationsthesis

Uploaded by

Copyright:

Available Formats

Ridgelets Theory and Applicationsthesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ridgelets Theory and Applicationsthesis

Uploaded by

Copyright:

Available Formats

RIDGELETS:

THEORY AND APPLICATIONS

Emmanuel Jean Candes

I certify that I have read this dissertation and that in my opinion

I certify that I have read this dissertation and that in my opinion

Approved for the University Committee on Graduate Studies:

3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1 Neural Networks

fi = fi;1 + (1 ; )(k x ; b ) (1.2)

where ( k b ) are solutions of the optimization problem

1.2 Approximation Theory

In Chapter 5, we will see how one can formalize this statement.

1.3 Statistical Estimation

1.3.1 Projection Pursuit Regression (PPR)

1.3.2 Neural Nets Again

1.3.3 Statistical Methodology

1.4 Harmonic Analysis

1.5.1 A Continuous Representation

for f 2 L1 \ L2(Rd), where c is a constant which depends only on and (d ) /

1.5.2 Discrete Representation

dN (F D) = sup dN (f D): (1.11)

In Chapter 4, we introduce a new scale of functional classes, not currently studied in

; = f = (a u b) a b 2 R a > 0 u 2 Sd;1 g

Then is called an Admissible Neural Activation Function.

$Original ridgelet.] $After rescaling.]

$After shifting.] $After rotation.]

Figure 2.1: Ridgelets.

We will call the ridge function  generated by an admissible a ridgelet.

2.1 A Reproducing Formula

where c = (2 );d K;1 .

If is real valued, b(; ) = b( ) hence,

2.2 A Parseval Relation

and, by Fubini, we get the desired result.

the order of inner product and integration over , one obtains

where we have used kwau k1  k e k1 kf k1 = a1=2 k k1 1kf k1 :

(2) kf ; f"k2 ! 0 as " ! 0.

cZ Z da j b(a )j2  expfiu x ;  2 gfb(u) ddu:

Note that for  6= 0, we have j a j t

c" (kkk)fb(k) expf; 2 kkk2 g ;! c" (kkk)fb(k) in L2 (Rd ) as ! 0:

Then for a xed u, kRu f  Ru  ; Ru f k1 ! 0 as! 0 and

2.3 A Semi-Continuous Reproducing Formula

j'^()j2 = X 2j(d 1)j ^(2 j )j2:

As a consequence, we have that for any  2 R

j'^()j2 + X 2j(d 1) j ^(2 j )j2 = jjd

3.1 Generalities about Frames

in which case A and B are called frame bounds.

Figure 3.1: Diagram schematically illustrating the ridgelet discretization of the

3.3 Main Result

The theorem is proved in several steps. We rst show:

For simplicity we denote Fj = jRu f  j j2 . Applying the lemma gives

Now, we sum over k:

Applying Plancherel, we have

f (u) a0  (a0  ) d  B 0 kfbk22 : (3.9)

Thus, if b0 is chosen small enough, Theorem 4 holds.

3.4 Irregular Sampling Theorems

where c` can be chosen equal to 2e;`d ; 1.

for ` = e`d ; 1 < 1.

Applying Cauchy-Schwarz and summing over m, we get

measure with bounds  . Then

3.5 Proof of the Main Result

for f 2 L1 \ L2(Rd), where c is a constant which depends only on and (d ) /

; = f = (a u b) a b 2 R a > 0 u 2 Sd;1 g

We will call the ridge function generated by an admissible a ridgelet.

where c = (2 );d K;1 .

If is real valued, b(; ) = b( ) hence,

where we have used kwau k1 k e k1 kf k1 = a1=2 k k1 1kf k1 :

cZ Z da j b(a )j2 expfiu x ; 2 gfb(u) ddu:

Note that for 6= 0, we have j a j t

Then for a xed u, kRu f Ru ; Ru f k1 ! 0 as! 0 and

j'^()j2 = X 2j(d 1)j ^(2 j )j2:

As a consequence, we have that for any 2 R

j'^()j2 + X 2j(d 1) j ^(2 j )j2 = jjd

For simplicity we denote Fj = jRu f j j2 . Applying the lemma gives

f (u) a0 (a0 ) d B 0 kfbk22 : (3.9)

measure with bounds . Then

And similarly, we can de ne 0 by changing b( ) into b( ). Then,

and likewise for 0 .

and respectively for 0 .

(Note, nally, that 2; ;1 , (x) is in fact '(ui x ; kb0 ).)

kf kpR_ spp Ave

Ruf (t) = (t)/(vt):

C1 v;1 kRu f kB_ 11 and kRuf kB C2(1 + v

Using Corollary 1, we have that for < d,

And conversely, for d,

kf kRspq () = inf kgkRspq = inf kgkL (Rd) + kgkRpqs (Rd)

kfn+1 ; fnkRspq () 3 n n 0

kgn+1 ; gnkRspq (Rd) 2 ;n

k X(gn+1 ; gn)kRspq (Rd) X kgn+1 ; gnkRspq (Rd) X 2

gj = f and supp g 1 :

where the i 's are normalized singularities of degree .

when these spaces are restricted to the unit ball d .

kgk22 / Ave k R g ' k 2 + X 2j (d ;1)

Now it is clear that Ru g ' (resp. Ru g j ) is an analytic function of exponential type 1

kRug 'k2 C kRug 'k1 and kRug j k2 C 2j=2kRug j k1:

kgk2 C kgkRs (Rd):

lim kg + ; gkR1s1 (Rd ) = 0:

kRug j k1 kRug j k1kk1 = kRug j k1

kfkRs ckf kRs :

Ave kg k 1 d;1)=2 C 2js2;j

g (t) = (t)(1 + v2 ) ( a ) (vt ; (1 + v2 )1=2 b):

k akB kkB ka (a )kB 1

where the constant C depends only on . Therefore,

case 2. 2j (1 + v2 );1=2 1: We rewrite things slightly di erently.

C ;(1 + jtj) m + (1 + j 1tj) m

C 2j(d 1)2js 2 j d; jwj (g)(u b)jdudb

A = f'(u x ; b) u 2 Sd;1 b 2 Rg f2;js 2j d;2 1 j (u x ; b) j 0 u 2 Sd;1 b 2 Rg:

l(f ) (2 );d sup l(h)kf kR1s1 (d ) sup l(h):

where f1 2 R1s1 (d ), kf1 kR1s1 (d ) 12 (2 )d and

As 2 S (R) is regular, < 1. It follows that with = = , we have the representation

where is a normalized singularity of degree and by construction kk1 1. We note that

a; (a(t ; b)) = ~ (t ; b)

Hence setting C1 = 12 max( 0 );1 (2 )d gives the desired result.

instead of . The representation (4.19) may be rewritten as

degree . Suppose now that jvj 2j we then have that

k ~kB_ 1 C kkB_ 1 k~ kB_ 1