OPERATOR LEARNING ALGORITHMS AND ANALYSIS
OPERATOR LEARNING ALGORITHMS AND ANALYSIS
OPERATOR LEARNING ALGORITHMS AND ANALYSIS
1. Introduction
This paper overviews algorithms and analysis related to the subject of operator
learning: finding approximations of maps between Banach spaces, from data. Our
focus is primarily on neural operators, which leverage the success of neural networks
in finite dimensions; but we also cover related literature in the work specific to
learning linear operators, and the use of Gaussian processes and random features.
In subsection 1.1 we discuss the motivation for our specific perspective on operator
learning. Subsection 1.2 contains a literature review. Subsection 1.3 overviews the
remainder of the paper.
1.1. High Dimensional Vectors Versus Functions. Many tasks in machine
learning require operations on high dimensional tensors1 arising, for example, from
pixellation of images or from discretization of a real-valued mapping defined over a
subset of Rd . The main idea underlying the work that we overview in this paper is
that it can be beneficial, when designing and analyzing algorithms in this context,
to view these high dimensional vectors as functions u : D → Rc defined on a
domain D ⊂ Rd . For example (c, d) = (3, 2) for RGB images and (c, d) = (1, 3) for
a scalar field such as temperature in a room. Pixellation, or discretization, of D
will lead to a tensor with size scaling like N , the number of pixels or discretization
points; N will be large and hence the tensor will be of high dimension. Working
with data-driven algorithms designed to act on function u, rather than the high
dimensional tensor, captures intrinsic properties of the problem, and not details
related to specific pixellation or discretization; as a consequence models learned
from data can be transfered from one pixellation or discretization level to another.
Consider the image shown in Figure 1a, at four levels of resolution. As an RGB
image it may be viewed as a vector in R3N where N is the number of pixels. However
by the time we reach the highest resolution (bottom right) it is more instructive to
view it as a function mapping D := [0, 1]2 ⊂ R2 into R3 . This idea is summarized
in Figure 1b. Even if the original machine learning task presents as acting on high
dimensional tensor of dimension proportional to N , it is worth considering whether
it may be formulated in the continuum limit N → ∞, conceiving of algorithms
in this setting, and only then approximating to finite dimension again to obtain
practical algorithms. These ideas are illustrated in Figures 2a and 2b.
(a) Directly design algorithm at fixed (b) Design algorithm at limit of infinite
resolution N resolution
1.2. Literature Review. We give a brief overview of the literature in this field;
greater depth, and more citations, are given in subsequent sections.
2. Operator Learning
In subsection 2.1 we define supervised learning, followed in subsection 2.2 by
discussion of the topic in the specific case of operator learning. Subsection 2.3 is
devoted to explaining how the approximate operator is found from data (training)
and how it is evaluated (testing). Subsection 2.4 describes how latent structure can
be built into operator approximation, and learned from data. Subsection 2.5 con-
tains an example from parametric partial differential equations (PDEs) describing
flow in a porous medium.
2.1. Supervised Learning. The objective of supervised learning is to determine
an underlying mapping Ψ† : U → V from samples2
(2.1) {un , Ψ† (un )}N
n=1 , un ∼ µ.
Here the probability measure µ is supported on U. Often supervised learning is
formulated by use of the data model
(2.2) {un , vn }N
n=1 , (un , vn ) ∼ π,
where the probability measure π is supported on U × V. The data model (2.1) is a
special case which is sufficient for the exposition in this article.
In the original applications of supervised learning U = Rdx and V = Rdy (regres-
sion) or V = {1, · · · K} (classification). We now go beyond this setting.
2.2. Supervised Learning of Operators. In many applications arising in science
and engineering it is desirable to consider a generalization of the finite-dimensional
setting to separable Banach spaces U, V of vector-valued functions:
U = {u : Dx → Rdi }, D x ⊆ Rd x
V = {v : Dy → Rdo }, D y ⊆ Rd y .
2Note that from now on N denotes the data volume (and not the size of a finite dimensional
problem as in subsection 1.1).
OPERATOR LEARNING 5
2.3. Training and Testing. The data (2.1) is used to train the model Ψ; that
is, to determine a choice of θ. To this end we introduce an error, or relative error,
measure r : V ′ × V ′ → R+ . Here V ′ is another Banach space containing the range
of Ψ† and Ψ( · ; θ). Typical choices for r include the error
(2.3) r(v1 , v2 ) = ∥v1 − v2 ∥V ′
and, for ε ∈ (0, ∞), one of the relative errors
∥v1 − v2 ∥V ′ ∥v1 − v2 ∥V ′
(2.4) r(v1 , v2 ) = , or r(v1 , v2 ) = .
ε + ∥v1 ∥V ′ max{ε, ∥v1 ∥V ′ }
Now let µN be the empirical measure
N
1 X
µN = δu .
N n=1 n
Then the parameter θ⋆ is determined from
q
θ∗ = argminθ RN (θ), RN (θ) := Eu∼µN r Ψ† (u), Ψ(u; θ) ,
for some positive q, typically q = 1. Function RN (θ) is known as the empirical risk;
also of interest is the expected (or population) risk
q
R∞ (θ) := Eu∼µ r Ψ† (u), Ψ(u; θ) .
Note that R∞ (θ) requires knowledge of data in the form of the entire probability
measure µ.
Once trained, models are typically tested by evaluating the error
′ q
error := Eu∼µ r Ψ† (u), Ψ(u; θ) .
Here µ′ is defined on the support of µ. For computational purposes the measure µ′
is often chosen as another empirical approximation of µ, independently of µN ; other
empirical measures may also be used. For theoretical analyses µ′ may be chosen
equal to µ, but other choices may also be of interest in determining the robustness
of the learned model; see, for example, [13].
2.4. Finding Latent Structure. Behind many neural operators is the extraction
of latent finite dimensional structure, as illustrated in Figure 3. Here we have two
encoder/decoder pairs on U and V, namely
GU ◦ FU ≈ IU , GV ◦ FV ≈ IV
where IU , IV are the identity maps on U and V respectively. Then φ is chosen so
that
GV ◦ φ ◦ FU ≈ Ψ† .
The map FU extracts a finite dimensional latent space from the input Banach
space while the map GV returns from a second finite dimensional latent space to
the output Banach space. These encoder-decoder pairs can be learned, reducing
the operator approximation to a finite dimensional problem.
6 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
3We use the notation a for input functions here, because it is a commonly adopted notation
in applications to porous medium flow.
OPERATOR LEARNING 7
sensor points xℓ . Another alternative of note are active subspaces [147], which
combine information about the input distribution and forward mapping through
its gradient. This approach has been explored for function encoding in scientific
ML in [93, 102].
The decoder in the DeepONet architecture is given by expansion with respect to
a neural network basis. Given a neural network ψ : Dy × Θψ → RdV , which defines
a parametrized function from the domain Dy of the output functions to RdV , the
DeepONet decoder is defined as
dV
X
GV : RdV → V, GV (α) = αj ψj .
j=1
The above encoder and decoders on U and V are combined with a finite-dimensional
neural network α : RdU × Θα → RdV , to define the parametrized DeepONet,
dV
X
ΨDEEP (u; θ)(y) = αj (Lu; θα )ψj (y; θψ ), ∀u ∈ U, y ∈ Dy .
j=1
This architecture is specified by choice of the linear encoding L, and choice of neural
network architectures for α and ψ. Following [92] the network α is conventionally
referred to as the “branch-net” (often denoted b or β), while ψ is referred to as the
“trunk-net” (often denoted t or τ ). The combined parameters θ = {θα , θψ } of these
neural networks are learned from data of input- and output-functions.
3.3. FNO. The Fourier Neural Operator (FNO) architecture was introduced in
[68,83]. In contrast to the PCA-Net and DeepONet architectures above, FNO is not
based on an approach that combines an encoding/decoding to a finite-dimensional
latent space with a finite-dimensional neural network. Instead, neural operators
such as the FNO generalize the structure of finite-dimensional neural networks to a
function space setting, as summarized below. We will assume that the domain of the
input and output functions Dx = [0, 2π]d can be identified with the d-dimensional
periodic torus.
FNO is an operator architecture of the form,
ΨF N O (u; θ) = Q ◦ LL ◦ · · · L2 ◦ L1 ◦ R(u), ∀u ∈ U,
Lℓ (v)(x; θ) = σ Wℓ v(x) + bℓ + K(v)(x; γℓ ) .
It comprises of input and output layers Q, R, given by a pointwise composition
with either a shallow neural network or a linear transformation, and several hidden
layers Lℓ .
Upon specification of a “channel width” dc , the ℓ-the hidden layer takes as
input a vector-valued function v : x 7→ v(x) ∈ Rdc and outputs another vector-
valued function Lℓ (v) : x 7→ Lℓ (v)(x) ∈ Rdc .4 Each hidden layer involves an affine
transformation
v(x) 7→ w(x) := Wℓ v(x) + bℓ + K(v)(x; γℓ ),
and a pointwise composition with a standard activation function
w(x) 7→ σ(w(x)),
where σ could e.g. be the rectified linear unit or a smooth variant thereof.
In the affine transformation, the matrix-vector pair (Wℓ , bℓ ) defines a pointwise
affine transformation of the input v(x), i.e. multiplication by matrix Wℓ and adding
4Channel width can change from layer to layer; we simplify the exposition by fixing it.
OPERATOR LEARNING 9
Here, the inner sum represents the action of κ b = F(κ) on F(v), and the outer sum
is the inverse Fourier transform F −1 . The Fourier coefficients κ bk,ij represent the
tunable parameters of the convolutional operator. In a practical implementation,
a Fourier cut-off kmax is introduced and the sum over k is restricted to Fourier
wavenumbers |k|ℓ∞ ≤ kmax , with | · |ℓ∞ the ℓ∞ -norm, resulting in a finite number
of tunable parameters γℓ = {b κk,ij : |k|ℓ∞ ≤ kmax , i, j = 1, . . . , dc }.
To summarize, the FNO architecture is defined by
(1) an input layer R, given by pointwise composition of the input function with
a shallow neural network or a linear transformation,
(2) hidden layers L1 , . . . , LL involving, for each ℓ = 1, . . . , L, matrix Wℓ , bias
bℓ and convolutional operator K( · ; γℓ ) with parameters γℓ identified with
the corresponding Fourier multipliers κ bk,ij ,
(3) an output layer Q, given by pointwise composition with a shallow neural
network or a linear transformation.
The composition of these layers defines a parametrized operator u 7→ ΨF N O (u; θ),
where θ collects parameters from (1), (2) and (3). The parameters contained in
θ are to be trained from data. The hyperparameters of FNO include the channel
width dc , the Fourier cut-off kmax , the depth L and additional hyperparameters
specifying the input and output layers R and Q.
In theory, the FNO is formulated directly on function space and does not involve
a reduction to a finite-dimensional latent space. In a practical implementation,
it is usually discretized by identifying the input and output functions with their
point-values on an equidistant grid. In this case, the discrete Fourier transform can
be conveniently evaluated using the fast Fourier transform algorithm (FFT), and
straightforward implementation in popular deep learning libraries is possible.
3.4. Random Features Method. The operator learning architectures above are
usually trained from data using stochastic gradient descent. Whilst this shows
great empirical success, the inability to analyze the optimization algorithms used
by practitioners makes it difficult to make definitive statements about the networks
that are trained in practice. The authors of [101] have proposed a randomized
alternative, by extending the random features methodology [114] to a function space
setting; this methodology has the advantage of being trainable through solution of
a quadratic optimization problem.
The random feature model (RFM) requires specification of a parametrized op-
erator ψ : U × Γ → V with parameter set Γ, and a probability measure ν on the
10 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
3.5. Discussion. The architectures above can be roughly divided into two cate-
gories, depending on how the underlying ideas from deep learning are leveraged to
define a parametrized class of mappings on function space.
Encoder-Decoder Network Structure. The first approach, which we refer
to as encoder-decoder-net and which includes the PCA-Net and DeepONet ar-
chitectures, involves three steps: first, the input function is encoded by a finite-
dimensional vector; second, an ordinary neural network, such as a fully connected
or convolutional neural network, is employed to map the encoded input to a finite-
dimensional output; third, a decoder maps the finite-dimensional output to an
output function in the infinite-dimensional function space.
This approach is very natural from a numerical analysis point of view, sharing
the basic structure of many numerical schemes, such as finite element methods
(FEM), finite volume methods (FVM), and finite difference methods (FDM), as
illustrated in Table 1. From this point of view, encoder-decoder-nets mainly dif-
fer from standard numerical schemes by replacing the hand-crafted algorithm and
choice of numerical discretization by a data-driven algorithm encoded in the weights
and biases of a neural network, and the possibility for a data-driven encoding and
reconstruction. While appealing, such structure yields approximations within a
fixed, finite-dimensional, linear subspace of V. In particular, each output function
from the approximate operator belongs to this linear subspace independently of the
input function. Therefore these methods fall within the category of linear approx-
imation, while methods for which outputs lie on a nonlinear manifold in V lead
to what is known as nonlinear approximation. The benefits of nonlinear approxi-
mation are well understood in the context of functions [38], however, for the case
of operators, results are still sparse but benefits for some specific cases have been
observed [27, 69, 75, 79]. The FNO [83] and random features [101] are concrete ex-
amples of operator learning methodologies for which the outputs lie on a nonlinear
manifold in V.
Neural Operators Generalizing Neural Network Structure. A second ap-
proach to defining a parametrized class of operators on function space, distinct from
encoder-decoder-nets, is illustrated by FNO. Following this approach, the structure
of neural networks, which consist of an alternating composition of affine and nonlin-
ear layers, is retained and generalized to function space. Nonlinearity is introduced
OPERATOR LEARNING 11
via composition with a standard activation function, such as rectified linear unit
or smooth variants thereof. The affine layers are obtained by integrating the in-
put function against an integral kernel; this introduces nonlocality which is clearly
needed if the architecture is to benefit from universal approximation.
Optimization and Randomization. The random features method [101] can
in principle be combined with any operator learning architecture. The random
features approach opens up a less explored direction of combining optimization with
randomization in operator learning. In contrast to optimization of all parameters
(by gradient descent) within a neural operator approach, the RFM allows for in-
depth analysis resulting in error and convergence guarantees that take into account
the finite number of samples, the finite number of parameters and the optimization
[76]. One interesting, and largely unresolved question is how to design efficient
random features for the operator learning setting.
The RFM is closely related to kernel methods which have a long pedigree in
machine learning. In this context, we mention a related kernel-based approach
proposed in [12], which employs kernel methods for operator learning within the
encoder-decoder-net paradigm. This approach has shown to be competitive on
several benchmark operator learning problems, and has been analyzed in [12].
Other Approaches. This review is mostly focused on methods that fall into
one of the neural network-based approaches above, but it should be emphasized
that other approaches are being actively pursued with success. Without aiming to
present an exhaustive list, we mention nonlinear reduced-order modeling [79,87,111,
112, 134], approaches based on the theory of Koopman operators [81, 99, 108, 145],
work aiming to augment and speed up numerical solvers [131], or work on data-
driven closure modeling [58, 91, 139–141], to name just a few examples. For a
broader overview of other approaches to machine learning for PDEs, we refer to the
recent review [20]. While most “operator learning” is focused on operators mapping
between functions with spatial or spatio-temporal dependence and often arising
in connection with PDEs, we note that problems involving time-series represent
another important avenue of machine learning research, which can also be viewed
from, and may benefit from, the continuous viewpoint [77].
4. Universal Approximation
The goal of the methodologies summarized in the last section is to approximate
operators mapping between infinite-dimensional Banach spaces of functions. The
first theoretical question to be addressed is whether these methods can achieve this
task, even in principle? The goal of this line of research is to identify classes of
operators for which operator learning methodologies possess a universal approxi-
mation property, i.e. the ability to approximate a wide class of operators to any
given accuracy in the absence of any constraints on the model size, the number of
data samples and without any constraints on the optimization.
12 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
Theorem 4.1. Suppose that U, V are separable Banach spaces with the approx-
imation property. Let Ψ† : U → V be a continuous operator. Fix a compact set
K ⊂ U. Then for any ϵ > 0, there exist positive integers dU , dV , bounded linear
maps FU : U → RdU , GV : RdV → V, and a function α ∈ C(RdU ; RdV ), such that
sup ∥Ψ† (u) − (FU ◦ α ◦ GV )(u)∥V ≤ ϵ.
u∈K
♢
A Banach space is said to have the approximation property if, over any compact
set, the identity map can be resolved as the limit of finite rank operators [86]. Al-
though it is a fundamental property useful in many areas in approximation theory,
it is not satisfied by all separable Banach spaces [43]. However, many of the Ba-
nach spaces used in PDE theory and numerical analysis such as Lebesgue spaces,
Sobolev spaces, Besov spaces, and spaces of continuously differentiable functions
all posses the approximation property [68, Lemma 26]. The above therefore covers
a large range of scenarios in which encoder-decoder-nets can be used. We now give
examples of encoder-decoder-nets where we fix the exact functional form of FU and
GV and show that universal approximation continues to hold.
4.1.1. Operator Network and DeepONet. The specific form of operator networks as
introduced and analysed by Chen and Chen in [24] focuses on scalar-valued input
and output functions. We will state the main result of [24] in this setting for
OPERATOR LEARNING 13
♢
Here, we can identify the linear encoder Lu = (u(x1 ), . . . , u(xdU )), the shallow
branch-net α, with components
XN XdU
αk (Lu) = cki σ k
ξij u(xj ) + bi ,
i=1 j=1
4.2. Neural Operators. The Fourier neural operator (FNO) is a specific instance
of a more general notion of neural operators [68]. The general structure of such
neural operators is identical to that of the FNO, i.e. a composition
ΨN O (u; θ) = Q ◦ LL ◦ · · · ◦ L2 ◦ L1 ◦ R(u), ∀ u ∈ U,
(4.2)
Lℓ (v)(x; θ) = σ (Wℓ v(x) + bℓ + K(v)(x; γℓ )) ,
except that the convolutional operator in each layer is replaced by a more general
integral operator, ˆ
K(v)(x; γ) = κ(x, y; γ)v(y) dy.
D
Here, the integral kernel κ(x, y; γ) is a matrix-valued function of x and y, parametrized
by γ. Additional nonlinear dependency on the input function is possible yielding,
for example, κ = κ(x, y, u(x), u(y); γ); this structure is present in transformer mod-
els [136]. Different concrete implementations of such neural operators mostly differ
in their choice of the parametrized integral kernel. For example, [83] uses Fourier
basis, [135] uses wavelet basis, and [16] uses spherical harmonics. Other approaches
restrict the support of κ [82] or assume it decays quickly away from its diago-
nal [71, 84]. For a more thorough review of this methodology, we refer to [68].
The universality of the FNO has first been established in [67], using ideas from
Fourier analysis and, in particular, building on the density of Fourier series to show
OPERATOR LEARNING 15
that FNO can approximate a wide variety of operators. Given the great variety
of possible alternative neural operator architectures, which differ from the FNO
essentially only in their choice of the parametrized kernel, a proof of universality
that does not explicitly rely on Fourier analysis, and which applies to a wide range
of choice for the integral kernel is desirable. This has been accomplished in [73],
where the authors remove from the FNO all non-essential components, from the
perspective of universal approximation, yielding a minimal architecture termed the
“averaging neural operator” (ANO).
♢
As an immediate consequence, we have the following corollary which implies
universality of neural operators for a wide range of choices:
Corollary 4.6. Consider any neural operator architecture of the form (4.2) with
parametrized integral kernel κ(x, y; γ). If for any channel dimension dc and matrix
V ∈ Rdc ×dc , there exists a parameter γV such that κ(x, y; γV ) ≡ V , then the neural
operator architecture is universal in the sense of Theorem 4.5. ♢
16 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
ϕ ψ
α(u) ∈ Rdc
u(x) ∈ U Ψ(u)(x) ∈ V
The last corollary applies in particular to the FNO. Universality of the FNO was
first established in [67], there restricting attention to a periodic setting but allowing
for operators mapping between general Sobolev spaces. The idea of the averaging
neural operator can be found in [73], where it was used to prove universality for a
wide range of neural operator architectures, and for a class of operators mapping
between general Sobolev spaces, or spaces of continuously differentiable functions.
Intuition. We would like to provide further intuition for the universality of ANO.
A simple special case of the ANO is the two-layer ANO obtained as follows. We
consider neural operators which can be written as a composition of two shallow
neural networks ϕ : Rdi × D → Rdc , ψ : Rdc × D → Rdo and an additional integral
(average):
ˆ
α(u) = ϕ(u(y), y) dy,
D
Ψ(u)(x) = ψ (α(u), x) ,
where
ϕ(w, x) = C1 σ(A1 w + B1 x + d1 ),
and
ψ(z, x) = C2 σ(A2 z + B2 x + d2 ).
Note that the above composition, i.e.
ˆ
Ψ(u)(x) = ψ ϕ(u(y), y) dy, x ,
D
Q(v)(x) = Qv(x), Q = C2 .
As depicted in Figure 5, such shallow ANO can be interpreted as a composition
of an nonlinear encoder α : U → Rdc , u 7→ α(u) defined via spatial averaging
of ϕ, and mapping the input function to a finite-dimensional latent space, and a
nonlinear decoder ψ : Rdc → V, α 7→ ψ(α, · ). This interpretation opens up a path
for analysis, based on which universality can be established for the ANO, and any
neural operator architecture that contains such ANO as a special case, such as the
Fourier neural operator.
OPERATOR LEARNING 17
j=1
for some s ∈ R. Then the following theorem holds [36, Theorem 1.3].
Theorem 5.1. Suppose that for some α > 1/2 and p ∈ R, we have ⟨φj , Γφj ⟩U ≍
j −2α and σj2 ≍ j −2p . Let α′ ∈ [0, α + 1/2) and assume that min{α, α′ } + min{p −
1/2, s} > 0. Then, as N → ∞, we have
∞
′ α′ +min{p−1/2,s}
j −2α |lj − lj† |2 ≲ N −
X
E α+p
j=1
where the expectation is taken over the product measure defining the input data
and noise. ♢
Theorem 5.1 quantifies the amount of data needed, on average, for the estimator
{lj } to achieve ϵ-error in approximating {lj† } measured in a squared weighted ℓ2
norm. In particular, we have
α+p
− α′ +min{p−1/2,s}
N ∼ϵ .
The exact dependence of this rate on the parameters defining the smoothness of
the truth, the input, the prior, and the error metric elucidates the optimal design
choices for the estimator and sheds light on which pieces are most important for
the learning process. We refer to [36] for an in-depth discussion.
Within machine learning and functional data analysis, many works have focused
on learning integral kernel operators [1,33,59,119,138]. The operator leaning prob-
lem can then be reduced to approximation of the kernel function and is typically
studied in a Reproducing Kernel Hilbert Space setting. In numerical PDEs, some
recent works have studied the problem of learning the Green’s function arising from
some elliptic, parabolic and hyperbolic problems [17–19, 137].
5.2. Holomorphic Operators. Going beyond the linear case, holomorphic op-
erators represent a very general class of operators for which efficient quantita-
tive error and complexity estimates can be established. We mention the influ-
ential work by Cohen, DeVore and Schwab [29, 30], as well as further develop-
ments [3–7, 25, 26, 52, 94, 103, 124, 125]. A detailed review, from 2015, can be found
in [28]. We will only describe a few main ideas. We mention in passing also the
works [70, 80], which study the neural network approximation of parametric oper-
ators in a related setting.
Holomorphic operators have mostly been studied in a parametrized setting,
where the input functions can be identified with the coefficients in a suitable ba-
sis (frame) expansion. A prototypical example is the elliptic Darcy flow equa-
tion (2.5), where the coefficient field a = a( · ; y) is parametrized by a sequence
y = (y1 , y2 , . . . ) ∈ [−1, 1]N , e.g. in the form of a linear expansion,
∞
X
a(x; y) = a(x) + yj φj (x), x ∈ D,
j=1
In the above prototypical setting, and assuming that the sequence b with coefficients
bj = ∥φj ∥V decays sufficiently fast, it can be shown [28] that
Q∞F possesses a holo-
morphic extension to a subset of the infinite product CN = j=1 C = C × C × . . .
of the complex plane C. More precisely, there exists a holomorphic extension
∞
Y
(5.1) F: Eρj → VC , z 7→ F(z).
j=1
the Bernstein ellipse with focal points ±1 and major and minor semi-axes lengths
1 −1
2 (ρj ± ρj ), and VC is the complexification of the Banach space V.
In general, given a non-negative sequence b ∈ [0, ∞)N and ε > 0, a parametrized
operator F : [−1, 1]N → V is called (b, ε)-holomorphic, if it possesses a holomorphic
extension (5.1) for any ρ = (ρ1 , ρ2 , . . . ) ∈ [1, ∞)N , s.t.
∞
ρj + ρ−1
!
j
X
(5.2) − 1 bj ≤ ε.
j=1
2
5.3. General (Lipschitz) Operators. The last two sections provide an overview
of theoretical results on the approximation of linear and holomorphic operators.
While these classes of operators include several operators of practical interest and
allow for the development of general approximation theory, not all operators of rel-
evance are holomorphic (or indeed linear). Examples of non-holomorphic operators
include the solution operator associated with nonlinear hyperbolic conservation
laws such as the compressible Euler equations. Solutions of such equations can
develop shocks in finite time, and it can be shown that the associated solution
operators themselves are not holomorphic. It is therefore of interest to extend the
approximation theory of operator learning frameworks beyond the restrictive class
of holomorphic operators.
A general and natural class of nonlinear operators are general Lipschitz contin-
uous operators, the approximation theory of which has been considered from an
operator learning perspective e.g. in [14, 45, 47, 90, 123]. We present a brief outline
of the general approach and mention relevant work on model complexity estimates
in the following subsections 5.3.1–5.3.3. Relevant results on the data complexity of
operator learning in this setting are summarized in subsection 5.3.4.
While the encoder and decoder of these architectures perform dimension re-
duction, the neural network ψ : RdU → RdV at the core of encoder-decoder-net
architectures is interpreted as approximating this resulting finite-dimensional func-
tion φ : RdU → RdV . To summarize, an encoder-decoder-net can conceptually be
interpreted as involving three steps:
(1) Dimension reduction on the input space U ≈ RdU ,
(2) Dimension reduction on the output space V ≈ RdV ,
(3) Encoding of the operator Ψ† yielding φ : RdU → RdV , approximated by
neural network ψ : RdU → RdV .
Each part of this conceptual decomposition, the encoding of U ≈ RdU , the decoding
RdV ≈ V and the approximation ψ ≈ φ, represents a source of error, and the
total encoder-decoder-net approximation error E is bounded by three contributions
E ≲ EU + Eψ + EV , where
EU = sup ∥u − GU ◦ FU (u)∥U ,
u
quantifies the encoding error, with supremum taken over the relevant set of input
functions u,
EV = sup ∥v − GV ◦ FV (v)∥V ,
v
quantifies the decoding error, and
(5.3) Eψ = sup ∥FV ◦ Ψ† ◦ GU (α) − ψ(α)∥,
α
is the neural network approximation error.
Given this decomposition, the derivation of error and complexity estimates for
encoder-decoder-net architectures thus boils down to the estimation of encoding
error EU , neural network approximation error Eψ and reconstruction error EV , re-
spectively.
Encoding and Reconstruction Errors. Encoding and reconstruction errors
are relatively well understood on classical function spaces such as Sobolev and
Besov spaces, by various linear and nonlinear methods of approximation [40].
For linear encoder/decoder pairs, the analysis of encoding and reconstruction
errors amounts to principal component analysis (PCA) when measuring the error
in the Bochner norm L2µ , or to Kolmogorov n-widths when measuring the error in
the sup-norm over a compact set. Relevant discussion of PCA in the context of
operator learning is given in [14] (see also [74] and [72]).
In certain settings, such as for PDEs with discontinuous output functions, it
has been shown [75] that relying on linear reconstruction imposes fundamental
limitations on the approximation accuracy of operator methodologies, which can be
overcome by methods with nonlinear reconstruction; specifically, it was shown both
theoretically and experimentally in [75] that FNO and shift-DeepONet, a variant
of DeepONet with nonlinear reconstruction, achieve higher accuracy than vanilla
DeepONet for prototypical PDEs with discontinuous solutions. We also mention
closely related work on the nonlinear manifold decoder architecture of [126].
Neural Network Approximation Error. At their core, encoder-decoder-net
architectures employ a neural network to approximate the encoded version FV ◦
Ψ† ◦ GU of the underlying operator Ψ† (cp. (5.3)). The practical success of these
frameworks thus hinges on the ability of ordinary neural networks to approximate
the relevant class of high-dimensional functions in the latent-dimensional spaces,
which are obtained through the encoding of such operators. While the empirical
success of neural networks in high-dimensional approximation tasks is undeniable,
our theoretical understanding and the mathematical foundation underpinning this
empirical success remains incomplete.
22 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
Remark 5.3. Note that the relevant dimension in the operator learning context
is the latent dimension d = dU . Neglecting logarithmic terms, we note that each
component of the function G : RdU → RdV can be approximated individually by
a neural network of size at most O(ε−dU /k ), and hence, we expect that G can be
approximated to accuracy ε by a neural network ψ of size at most O(dV ε−dU /k ). ♢
Without aiming to provide a comprehensive overview of this very active research
direction on neural network approximation theory, adjacent to operator learning
theory, we mention that similar error and complexity estimates can also be obtained
on more general Sobolev spaces, e.g. [130, 144]. Lower bounds illuminating the
limitations of neural networks on model classes are for example discussed in [2, 15,
142]. Approximation rates leveraging additional structure beyond smoothness have
also been considered, e.g. compositional structure is explored in [96, 122, 128].
Non-standard Architectures and Hyperexpressive Activations. While the
general research area of neural network approximation theory is too broad to ade-
quately summarize here, we mention relevant work on hyperexpressive activations,
which can formally break the curse of dimensionality observed Theorem 5.2; it has
been shown that neural networks employing non-standard activations can formally
achieve arbitrary convergence on model function classes [85,109,129,143], when the
complexity is measured in terms of number of tunable parameters. This is not true
for the ReLU activation [142]. Another way to break the curse of dimensionality is
via architectures with non-standard “three-dimensional” structure [148].
While non-standard architectures can overcome the curse of dimensionality in
the sense that the number of parameters does not grow exponentially with d (or
is independent of d), it should be pointed out that this necessarily comes at the
expense of the number of bits that are required to represent each parameter in a
practical implementation. Indeed, from work on quantized neural networks [15]
(with arbitrary activation function), it can be inferred that the total number of
bits required to store all parameters in such architectures is lower bounded by the
Kolmogorov ε-entropy of the underlying model class; For the specific model class
W k,∞ ([0, 1]d ), this entropy scales as ε−d/k . Hence, architectures which achieve
error ε with a number of parameters that scales strictly slower than ε−d/k must
do so at the expense of the precision that is required to represent each individual
parameter in a practical implementation, keeping the total number of bits above
the entropy limit. For related discussion, we e.g. refer to [144, section 7] or [130,
discussion on page 5]. Another implication of this fact is that the constructed
non-standard architectures are necessarily very sensitive to minute changes in the
network parameters.
OPERATOR LEARNING 23
5.3.2. Upper Complexity Bounds. Quantitative error estimates for operator learn-
ing based on the general approach outlined in the last subsection 5.3.1 have been
derived in a number of recent works [14, 45, 47, 56, 90, 95].
Relevant work. The two papers [14, 74], analyzing PCA-Net and DeepONet re-
spectively, both introduce a splitting of the error into encoder, neural network ap-
proximation and reconstruction errors. A similar error analysis is employed in [56]
for so-called “basis operator network”, a variant of DeepONet. An in-depth analy-
sis of DeepONets with various encoder/decoder pairs, including generalization error
estimates, is given in [90]. Quantitative approximation error estimates for convo-
lution neural networks applied to operator learning are derived in [45]. General
error estimates motivated by infinite-dimensional dynamical systems in stochastic
analysis can be found in [47]. An alternative approach to operator learning with
explicit algorithms for all weights is proposed in [95], including error estimates for
this approach.
Alternative Decompositions. Finally, we point out that while the error de-
composition in encoding, neural network approximation and reconstruction errors
is natural, alternative error decompositions, potentially more fine-grained, are pos-
sible. We mention the work [105] which proposes a mimetic neural operator archi-
tecture inspired by the weak variational form of elliptic PDEs, discretized by the
finite-element method; starting from this idea, the authors arrive at an architecture
that can be viewed as a variant of DeepONet, including a specific mixed nonlin-
ear and linear branch network structure and a nonlinear trunk net. In this work,
an a priori error analysis is conducted resulting in a splitting of the overall ap-
proximation in numerical approximation, stability, training and quadrature errors
depending on the data-generation with a numerical scheme (no access to the actual
operator), the Lipschitz stability of the underlying operator, the finite number of
training samples and quadrature errors to approximate integrals, respectively.
5.3.3. Lower Complexity Bounds. Operator learning frameworks are based on neu-
ral networks and provide highly nonlinear approximation [38]. Despite their aston-
ishing approximation capabilities, even highly nonlinear approximation methods
have intrinsic limitations.
Combined Error Analysis for Encoder-Decoder-Nets. To illustrate some
of these intrinsic limitations, we first outline the combined error analysis that results
from the decomposition summarized in the last section. To this end, we combine
the encoding, reconstruction and neural network analysis to derive quantitative
error and complexity estimates within the encoder-decoder-net paradigm.
Firstly, under reasonable assumptions on the input functions, the encoding error
can often be shown to decay at an algebraic rate in the dU , e.g.
(5.4) EU ≲ d−α
U .
For example, if we assume that the input functions are defined on a bounded domain
D ⊂ Rd and subject to a smoothness constraint such as a uniform bound on their k-
th derivative, then a decay rate α = k/d can be achieved (depending on the precise
setting); For dimension reduction by principal component analysis, the exponent
α instead relates to the decay rate of the eigenvalues of the covariance operator of
the input distribution.
Under similar assumptions on the set of output functions, depending on the
properties of the underlying operator Ψ† , the reconstruction error on the output
space often also decays algebraically,
(5.5) EV ≲ d−β
V ,
24 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
where the decay rate β can e.g. be estimated in terms of the smoothness of the
output functions under Ψ† , or could be related to the decay of the PCA eigenvalues
of the output distribution (push-forward under Ψ† ).
Finally, given latent dimensions dU and dV , the size of the neural network ψ that
is required to approximate the encoded operator mapping G : RdU → RdV , with
NN approximation error bound,
Eψ ≤ ε,
roughly scales as (cp. Remark 5.3),
(5.6) size(ψ) ∼ dV ε−dU /k ,
when the only information on the underlying operator is captured by its degree of
smoothness k (k = 1 corresponding to Lipschitz continuity). Note that this is the
scaling consistent with Kolmogorov entropy bounds.
Given the error decomposition E ≲ EU + Eψ + EV , we require each error con-
tribution individually to be bounded by ε. In view of (5.4) and (5.5), this can be
achieved provided that dU ∼ ε−1/α , dV ∼ ε−1/β . Inserting such choice of dU , dV in
(5.6), we arrive at a neural network size of roughly the form,
−1/α
size(ψ) ∼ ε−1/β ε−cε /k
.
In particular, we note the exponential dependence on ε−1 , resulting in a size esti-
mate,
−1/α
cε
(5.7) size(ψ) ≳ exp .
k
As pointed out after (5.4), when the set of input functions consists of functions
defined on a d-dimensional domain with uniformly bounded s-th derivatives (in a
suitable norm), then we expect a rate α = s/d, in which case we obtain,
−d/s
cε
(5.8) size(ψ) ≳ exp .
k
For operator learning frameworks, this super-algebraic (even exponential) de-
pendence of the complexity on ε−1 has been termed the “curse of dimensionality”
in [67, 74] or more recently “curse of parametric complexity” in [72]. The latter
term was introduced to avoid confusion, which may arise because in these oper-
ator learning problems, there is no fixed dimension to speak of. The curse of
parametric complexity can be viewed as an infinite-dimensional scaling limit of the
finite-dimensional curse of dimensionality, represented by the dU -dependency of the
bound ε−dU /k , and arises from the finite-dimensional CoD by observing that the
required latent dimension dU itself depends on ε, with scaling dU ∼ ε−1/α . We note
in passing that even if dU ∼ log(ε−1 ) were to scale only logarithmically in ε−1 , the
complexity bound implied by (5.6) would still be super-algebraic, consistent with
the main result of [72].
Nonlinear n-width Estimates. The rather pessimistic complexity bound out-
lined in (5.7) is based on an upper bound on the operator approximation error E ,
and is not necessarily tight. One may therefore wonder if more careful estimates
could yield complexity bounds that do not scale exponentially in ε−1 .
In this context, we would like to highlight the early work on operator approx-
imation by Mhaskar and Hahm [97] which presents first quantitative bounds for
the approximation of nonlinear functionals; most notably, this work identifies the
continuous nonlinear n-widths of spaces of Hölder continuous functionals defined on
L2 -spaces; it is shown that the relevant n-widths decay only (poly-)logarithmically
in n, including both upper and lower bounds.
OPERATOR LEARNING 25
We will presently state a simplified version of the main result of [97], and refer to
the original work for the general version. To this end, we recall that the continuous
nonlinear n-width dN (K; ∥ · ∥X ) [39] of a subset K ⊂ X , with (X , ∥ · ∥X ) Banach, is
defined as the optimal reconstruction error,
dN (K; ∥ · ∥X ) = inf sup ∥f − M (a(f ))∥X ,
(a,M ) f ∈K
where the infimum is over all encoder/decoder pairs (a, M ), consisting of a contin-
uous map a : K → Rn and general map M : Rn → X .
To derive lower n-width bounds, we consider spaces of nonlinear Lipschitz func-
tionals Ψ† ∈ Fd , where d denotes the spatial dimension of the input functions.
More, precisely define
Fd = Ψ† : L2 ([−1, 1]d ) → R ∥Ψ† (u)∥Lip ≤ 1 ,
with
|Ψ† (u) − Ψ† (v)|
∥Ψ† ∥Lip := sup |Ψ† (u)| + sup .
u∈L2 u,v∈L2 ∥u − v∥L2
Given a smoothness parameter s > 0, we consider approximation of Ψ† ∈ Fd ,
uniformly over a compact set of input functions K s ⊂ L2 ([−1, 1]d ), obtained as
follows: we expand input functions f ∈ L2 ([−1, 1]d ) in a Legendre expansion,
P
f (x) = k∈Nd fbk Pk (x), and consider functionals defined on the “Sobolev” ball,
X
K s := f ∈ L2 ([−1, 1]d ) |k|2s fbk |2 ≤ 1 .
d
k∈N
It follows from the main results of [97] that the continuous nonlinear n-widths of
the set of functionals Fd decay only (poly-)logarithmically, as
dn (Fd )C(K s ) ∼ log(n)−s/d .
In particular, to achieve uniform approximation accuracy ε > 0 with a continuous
encoder/decoder pair (a, M ), requires at least
(5.9) n ≳ exp cε−d/s ,
parameters. This last lower bound should be compared with (5.8) (for k = 1);
the lower n-width bound (5.9) would imply (5.8) under the assumption that the
architecture of ψ = ψ( · ; θ) was fixed and assuming that the weight assignment
Ψ† 7→ θΨ† from the functional Ψ† and the optimal tuning of neural network param-
eters θΨ† was continuous. The latter assumption may not be satisfied if parameters
are optimized using gradient descent.
Curse of Parametric Complexity. Given the result outlined in the previous
paragraph, one may wonder if the pessimistic bound (5.9) and (5.7), i.e. the “curse
of parametric complexity”, can be broken by (a) a dis-continuous weight assignment,
and (b) an adaptive choice of architecture optimized for specific Ψ† . This question
has been raised in [72, 78].
It turns out that, with operator learning frameworks such as DeepONet, FNO or
PCA-Net, and relying on standard neural network architectures, it is not possible
to overcome the curse of parametric complexity when considering approximation on
the full class of Lipschitz continuous or Fréchet differentiable operators. We mention
26 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
the following illustrative result for DeepONet, which follows from [78, Example
2.17]:
Fix α > 2 + kd . Then for any r ∈ N, there exists a r-times Fréchet differentiable
functional Ψ† : U → R and constant c, ε > 0, such that approximation to accuracy
ε ≤ ε by a DeepONet Ψ : U → R with ReLU activation,
(5.10) sup |Ψ† (u) − Ψ(u)| ≤ ε,
u∈K
with linear encoder E and neural network ψ, implies complexity bound
(5.11) size(ψ) ≥ exp(cε−1/αr ).
Here c, ε > 0 are constants depending only on k, α and r. ♢
As mentioned above, analogous lower complexity bounds can be obtained for
PCA-Net, Fourier neural operator and many other architectures, and the more
general version of this lower complexity bound applies to Sobolev input functions
and beyond. We refer to [78] for a detailed discussion.
Remark 5.5. In [78], no attempt was made to optimize the exponent of ε−1 in
(5.11). It would be interesting to know whether the appearance of the degree
of operator Féchet differentiability, i.e. the parameter r, in the lower bound is
merely an artifact of the proof in [78]. The back-of-the-envelope calculation leading
to (5.7) indicates that r should not appear in the exponent, and that the factor
α = k/d + 2 + δ could be replaced by k/d + δ. ♢
Breaking the Curse of Parametric Complexity with Non-Standard Ar-
chitectures. As mentioned in a previous section, there exist non-standard neural
network architectures which either employ non-standard activations [109, 129, 143],
or impose a non-standard “three-dimensional” connectivity [148] which can over-
come the curse of dimensionality in finite dimensions. In particular, encoder-
decoder-nets based on such non-standard architectures can achieve neural network
approximation error Eψ ≤ ε with a complexity (as measured by the number of
tunable degrees of freedom) that grows much slower than the rough scaling we
considered in (5.6). Based on such architectures, it has recently been shown [123]
that DeepONets can achieve approximation rates for general Lipschitz and Hölder
continuous operators which break the curse of parametric complexity implied by
(5.11). In fact, such architectures achieve algebraic expression rate bounds for
general Lipschitz and Hölder continuous operators.
5.3.4. Sample Complexity Results. There is a rapidly growing body of work on the
approximation theory of operator learning with focus on parametric complexity.
The complementary question of the sample complexity of operator learning, i.e.
how many samples are needed to achieve a given approximation accuracy, has not
received as much attention. The work described in subsection 5.1 addresses this
question in the setting of learning linear operators, and the question is also ad-
dressed in subsection 5.2 for holomorphic operators. We now develop this subject
further. Of particular note in the general Lipschitz setting of this subsection is the
paper [90], as well as related recent work in [23], which studies the nonparametric er-
ror estimation of Lipschitz operators for general encoder-decoder-net architectures.
In [90], non-asymptotic upper bounds for the generalization error of empirical risk
minimizers on suitable classes of operator networks are derived. The results are
OPERATOR LEARNING 27
stated in a Bochner L2 (µ) setting with input functions drawn from a probability
measure µ, and variants are derived for general (Lipschitz) encoder/decoder pairs,
for fixed basis encoder/decoder pairs, and for PCA encoder/decoder pairs. The
analysis underlying the approximation error estimates in [90] is based on a com-
bination of best-approximation error estimates (parametric complexity) which are
combined with statistical learning theory to derive sample complexity bounds.
For detailed results applicable to more general settings, we refer the reader
to [90]. Here, we restrict attention to a representative result for fixed basis en-
coder/decoder pair [90, Corollary 3], obtained by projection onto a trigonometric
basis.
To state this simplified result, consider a Lipschitz operator Ψ† : U → V, map-
ping between spaces U, V = L2 ([−1, 1]d ). We assume that there exists a con-
stant C > 0, such that the probability measure µ ∈ P(U) and its push-forward
Ψ†# µ ∈ P(V) are supported on periodic, continuously differentiable functions be-
longing to the set,
K := u ∈ L2 ([−1, 1]d ) u is periodic, ∥u∥C k,α ≤ C .
the empirical generalization error and the proposed capacity of FNO, in numerical
experiments. Out-of-distribution risk bounds for neural operators with focus on
the Helmholtz equation are discussed in depth in [13].
We finally mention the recent work [100], where a connection is made between the
number of available samples n and the required size of the DeepONet reconstruction
dimension dV . It is shown that when only noisy measurements
√ are available, a
scaling of the number of trunk basis functions dV ≳ n is required to achieve
accurate approximation.
5.4.1. Discussion. Ultimately, the overarching theme behind many of the above
cited results is that neural operators, or neural networks more generally, can effi-
ciently emulate numerical algorithms, which either result from bespoke numerical
methods or are a consequence of explicit representation formulae. The total com-
plexity of a neural network emulator, and reflected by its size, is composed of the
complexity of the emulated numerical algorithm and an additional overhead cost
of emulating this algorithm by a neural network (translation to neural network
weights). From an approximation-theoretic point of view, it could be conjectured
that, for a suitable definition of “numerical algorithm”, neural networks can effi-
ciently approximate all numerical algorithms, hence implying efficient approxima-
tion of a great variety of operators, excluding only those operators for which no
efficient numerical algorithms exist. Formalizing a suitable notion of numerical
algorithm and proving that neural networks can efficiently emulate any such algo-
rithm would be of interest and could provide a general way for proving algebraic
expression rate bound for a general class of operators that can be approximated
by a numerical method with algebraic memory and run-time complexity (i.e. any
“reasonable” approximation method).
6. Conclusions
Neural operator architectures employ neural networks to approximate nonlinear
operators mapping between Banach spaces of functions. Such operators often arise
from physical models which are expressed as partial differential equations (PDEs).
Despite their empirical success in a variety of applications, our theoretical under-
standing of neural operators remains incomplete. This review article summarizes
30 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
recent progress and the current state of our theoretical understanding of neural
operators, focusing on an approximation theoretic point of view.
The starting point of the theoretical analysis is universal approximation. Very
general universal approximation results are now available for many of the proposed
neural operator architectures. These results demonstrate that, given a sufficient
number of parameters, neural operators can approximate a very wide variety of
infinite-dimensional operators, providing a theoretical underpinning for diverse ap-
plications. Such universal approximation is arguably a necessary but not sufficient
condition for the success of these architectures. In particular, universal approxi-
mation is inherently qualitative and does not guarantee that approximation to a
desired accuracy is feasible at a practically realistic model size.
A number of more recent works thus aim to provide quantitative bounds on the
required model size and the required number of input-/output-samples to achieve
a desired accuracy ϵ. Most such results consider one of three settings: general Lip-
schitz (or Fréchet differentiable) operators, holomorphic operators, or specific PDE
operators. While Lipschitz operators are a natural and general class to consider,
it turns out that approximation to error ϵ with standard architectures requires an
exponential (in ϵ−1 ) number of tunable parameters, bringing into question whether
operator learning at this level of generality is possible. In contrast, the class of
holomorphic operators allows for complexity bounds that scale only algebraically
in ϵ−1 , both in terms of models size as well as sample complexity. Holomorphic
operators represent a general class of operators of practical interest, for which rigor-
ous approximation theory has been developed building on convergent (generalized)
polynomial expansions.
Going beyond notions of operator smoothness, it has been shown that operator
learning frameworks can leverage intrinsic structure of (PDE-) operators to achieve
algebraic convergence rates in theory; this intrinsic structure is distinct from holo-
morphy. Available results in this direction currently rely a case-by-case analysis
and often leverage emulation of traditional numerical methods. The authors of the
present article view the development of a general approximation theory, including a
characterization of the relevant structure that can be leveraged by neural operators,
as one of the great challenges of this field.
Acknowledgments. The authors are grateful to Dima Burov and Edo Calvello
for creating Figures 1a and 1b, 2a, 2b respectively. They are also grateful to Nick
Nelsen for reading, and commenting on, a draft of the paper. NBK is grateful to
the NVIDIA Corporation for full time employment. SL is grateful for support from
the Swiss National Science Foundation, Postdoc.Mobility grant P500PT-206737.
AMS is grateful for support from a Department of Defense Vannevar Bush Faculty
Fellowship.
References
[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative fil-
tering: Operator estimation with spectral regularization. The Journal of Machine Learning
Research, 10:803––826, 2009.
[2] E. M. Achour, A. Foucault, S. Gerchinovitz, and F. Malgouyres. A general approximation
lower bound in Lp norm, with applications to feed-forward neural networks. In Advances in
Neural Information Processing Systems, 2022.
[3] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. Near-optimal learning of
Banach-valued, high-dimensional functions via deep neural networks. arXiv preprint
arXiv:2211.12633, 2022.
[4] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. On efficient algorithms for computing
near-best polynomial approximations to high-dimensional, Hilbert-valued functions from
limited samples. arXiv preprint arXiv:2203.13908, 2022.
OPERATOR LEARNING 31
[30] A. Cohen, R. Devore, and C. Schwab. Analytic regularity and polynomial approximation of
parametric and stochastic elliptic PDEs. Analysis and Applications, 9(01):11–47, 2011.
[31] M. J. Colbrook and A. Townsend. Rigorous data-driven computation of spectral proper-
ties of Koopman operators for dynamical systems. Communications on Pure and Applied
Mathematics, 77(1):221–283, 2024.
[32] S. Cotter, G. Roberts, A. Stuart, and D. White. MCMC methods for functions: Modifying
old algorithms to make them faster. Statistical Science, 28, 02 2012.
[33] C. Crambes and A. Mas. Asymptotics of prediction in functional linear regression with
functional outputs. Bernoulli, 19(5B):2627 – 2651, 2013.
[34] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of con-
trol, signals and systems, 2(4):303–314, 1989.
[35] M. De Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy trade-off in
operator learning with neural networks. Journal of Machine Learning, 1:3:299–341, 2022.
[36] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learn-
ing linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification,
11(2):480–513, 2023.
[37] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Convergence rate of Deep-
ONets for learning operators arising from advection-diffusion equations. arXiv preprint
arXiv:2102.10621, 2021.
[38] R. A. DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998.
[39] R. A. DeVore, R. Howard, and C. Micchelli. Optimal nonlinear approximation. Manuscripta
mathematica, 63:469–478, 1989.
[40] R. A. DeVore and G. G. Lorentz. Constructive approximation, volume 303. Springer Science
& Business Media, 1993.
[41] W. E, C. Ma, and L. Wu. The Barron space and the flow-induced function spaces for neural
network models. Constructive Approximation, 55(1):369–406, 2022.
[42] W. E and S. Wojtowytsch. Representation formulas and pointwise properties for Barron
functions. Calculus of Variations and Partial Differential Equations, 61(2):1–37, 2022.
[43] P. Enflo. A counterexample to the approximation problem in Banach spaces. Acta Mathe-
matica, 130:309–316, 1973.
[44] L. C. Evans. Partial differential equations, volume 19. American Mathematical Soc., 2010.
[45] N. R. Franco, S. Fresca, A. Manzoni, and P. Zunino. Approximation bounds for convolutional
neural networks in operator learning. Neural Networks, 161:129–141, 2023.
[46] T. Furuya, M. Puthawala, M. Lassas, and M. V. de Hoop. Globally injective and bijective
neural operators. In Thirty-eight Conference on Neural Information Processing Systems,
2024.
[47] L. Galimberti, A. Kratsios, and G. Livieri. Designing universal causal deep learning models:
The case of infinite-dimensional dynamical systems from stochastic analysis. arXiv preprint
arXiv:2210.13300, 2022.
[48] C. R. Gin, D. E. Shea, S. L. Brunton, and J. N. Kutz. Deepgreen: Deep learning of Green’s
functions for nonlinear boundary value problems. arXiv preprint arXiv:2101.07206, 2020.
[49] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
[50] J. K. Gupta and J. Brandstetter. Towards multi-spatiotemporal-scale generalized PDE mod-
eling. arXiv preprint arXiv:2209.15616, 2022.
[51] M. Hairer, A. M. Stuart, and S. J. Vollmer. Spectral gaps for a Metropolis–Hastings algo-
rithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455–2490, 2014.
[52] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction
and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
[53] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems
using neural networks. Journal of Computational Physics, 363:55–78, 2018.
[54] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich. Optimization with PDE constraints,
volume 23. Springer Science & Business Media, 2008.
[55] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359–366, 1989.
[56] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning
nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
[57] D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameter-
to-observable maps. arXiv preprint arXiv:2402.06031, 2024.
[58] D. Z. Huang, K. Xu, C. Farhat, and E. Darve. Learning constitutive relations from indirect
observations using deep neural networks. Journal of Computational Physics, 416:109491,
2020.
[59] J. Jin, Y. Lu, J. Blanchet, and L. Ying. Minimax optimal kernel operator learning via mul-
tilevel training. In Eleventh International Conference on Learning Representations, 2022.
OPERATOR LEARNING 33
[60] P. Jin, S. Meng, and L. Lu. Mionet: Learning multiple-input operators via tensor product.
SIAM Journal on Scientific Computing, 44(6):A3490–A3514, 2022.
[61] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems, volume 160.
Springer Science & Business Media, 2006.
[62] T. Kim and M. Kang. Bounding the rademacher complexity of Fourier neural operator.
arXiv preprint arXiv:2209.05150, 2022.
[63] I. Klebanov and T. Sullivan. Aperiodic table of modes and maximum a posteriori estimators.
arXiv preprint arXiv:2306.16278, 2023.
[64] Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on
Mathematical Analysis, 54(6):6358–6389, 2022.
[65] V. Kostic, P. Novelli, A. Maurer, C. Ciliberto, L. Rosasco, and M. Pontil. Learning dynamical
systems via Koopman operator regression in reproducing kernel hilbert spaces. In Thirty-
sixth Annual Conference on Neural Information Processing Systems, 2022.
[66] N. B. Kovachki. Machine Learning and Scientific Computing. PhD thesis, California Insti-
tute of Technology, 2022.
[67] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds
for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
[68] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and
A. Anandkumar. Neural operator: Learning maps between function spaces with applica-
tions to PDEs. Journal of Machine Learning Research, 24(89), 2023.
[69] B. Kramer, B. Peherstorfer, and K. E. Willcox. Learning nonlinear reduced models from
data with operator inference. Annual Review of Fluid Mechanics, 56(1):521–548, 2024.
[70] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural
networks and parametric PDEs. Constructive Approximation, 55(1):73–125, 2022.
[71] R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri,
T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott,
A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather
forecasting. Science, 382(6677):1416–1421, 2023.
[72] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Jour-
nal of Machine Learning Research, 24(318), 2023.
[73] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: Universal approxima-
tion. arXiv preprint arXiv:2304.13221, 2023.
[74] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep
learning framework in infinite dimensions. Transactions of Mathematics and Its Applica-
tions, 6(1), 2022.
[75] S. Lanthaler, R. Molinaro, P. Hadorn, and S. Mishra. Nonlinear reconstruction for operator
learning of PDEs with discontinuities. In Eleventh International Conference on Learning
Representations, 2023.
[76] S. Lanthaler and N. H. Nelsen. Error bounds for learning with vector-valued random features.
In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[77] S. Lanthaler, T. K. Rusch, and S. Mishra. Neural oscillators are universal. In Thirty-seventh
Conference on Neural Information Processing Systems, 2023.
[78] S. Lanthaler and A. M. Stuart. The curse of dimensionality in operator learning. arXiv
preprint arXiv:2306.15924, 2023.
[79] K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds us-
ing deep convolutional autoencoders. Journal of Computational Physics, 404:108973, 2020.
[80] Z. Lei, L. Shi, and C. Zeng. Solving parametric partial differential equations with deep
rectified quadratic unit neural networks. Journal of Scientific Computing, 93(3):80, 2022.
[81] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposition
with dictionary learning: A data-driven adaptive spectral decomposition of the Koopman
operator. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(10):103111, 2017.
[82] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and
A. Anandkumar. Neural operator: Graph kernel network for partial differential equations.
arXiv preprint arXiv:2003.03485, 2020.
[83] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and
A. Anandkumar. Fourier neural operator for parametric partial differential equations. In
Ninth International Conference on Learning Representations, 2021.
[84] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, A. M. Stuart, K. Bhattacharya, and
A. Anandkumar. Multipole graph neural operator for parametric partial differential equa-
tions. In Thirty-fourth Annual Conference on Neural Information Processing Systems, 2020.
[85] S. Liang, L. Lyu, C. Wang, and H. Yang. Reproducing activation function for deep learning.
arXiv preprint arXiv:2101.04844, 2021.
34 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
[86] J. Lindenstrauss and L. Tzafriri. Classical Banach Spaces I: Sequence Spaces. Ergebnisse
der Mathematik und ihrer Grenzgebiete. 2. Folge. Springer Berlin Heidelberg, 2013.
[87] J. Ling, A. Kurzawski, and J. Templeton. Reynolds averaged turbulence modelling using
deep neural networks with embedded invariance. Journal of Fluid Mechanics, 807:155–166,
2016.
[88] P. Lippe, B. Veeling, P. Perdikaris, R. Turner, and J. Brandstetter. PDE-refiner: Achieving
accurate long rollouts with neural PDE solvers. In Thirty-seventh Annual Conference on
Neural Information Processing Systems, 2024.
[89] O. Litany, T. Remez, E. Rodola, A. Bronstein, and M. Bronstein. Deep functional maps:
Structured prediction for dense shape correspondence. In Proceedings of the IEEE interna-
tional conference on computer vision, pages 5659–5667, 2017.
[90] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of opera-
tors between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–
67, 2024.
[91] X. Liu, S. Tian, F. Tao, and W. Yu. A review of artificial neural networks in the constitutive
modeling of composite materials. Composites Part B: Engineering, 224:109152, 2021.
[92] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators
via DeepONet based on the universal approximation theorem of operators. Nature Machine
Intelligence, 3(3):218–229, 2021.
[93] D. Luo, T. O’Leary-Roseberry, P. Chen, and O. Ghattas. Efficient PDE-constrained op-
timization under high-dimensional uncertainty using derivative-informed neural operators.
arXiv preprint arXiv:2305.20053, 2023.
[94] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic
partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
[95] H. Mhaskar. Local approximation of operators. Applied and Computational Harmonic Anal-
ysis, 64:194–228, 2023.
[96] H. Mhaskar and T. Poggio. Function approximation by deep networks. Communications on
Pure and Applied Analysis, 19(8):4085–4095, 2020.
[97] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system
identification. Neural Computation, 9(1):143–159, 1997.
[98] M. Mollenhauer, N. Mücke, and T. Sullivan. Learning linear operators: Infinite-dimensional
regression as a well-behaved non-compact inverse problem. arXiv preprint arXiv:2211.08875,
2022.
[99] J. Morton, F. D. Witherden, A. Jameson, and M. J. Kochenderfer. Deep dynamical modeling
and control of unsteady fluid flows. In Proceedings of the 32nd International Conference on
Neural Information Processing Systems, 2018.
[100] A. Mukherjee and A. Roy. Size lowerbounds for deep operator networks. arXiv preprint
arXiv:2308.06338, 2023.
[101] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between
Banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
[102] T. O’Leary-Roseberry, P. Chen, U. Villa, and O. Ghattas. Derivative-informed neural op-
erator: An efficient framework for high-dimensional parametric derivative learning. Journal
of Computational Physics, 496:112555, 2024.
[103] J. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic
maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
[104] M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. Guibas. Functional maps:
A flexible representation of maps between shapes. ACM Trans. Graph., 31(4), 2012.
[105] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic
operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536,
2024.
[106] R. G. Patel and O. Desjardins. Nonlinear integro-differential operator regression with neural
networks. arXiv preprint arXiv:1810.08552, 2018.
[107] R. G. Patel, N. A. Trask, M. A. Wood, and E. C. Cyr. A physics-informed operator regression
framework for extracting data-driven continuum models. Computer Methods in Applied
Mechanics and Engineering, 373:113500, 2021.
[108] B. Peherstorfer and K. Willcox. Data-driven operator inference for nonintrusive projection-
based model reduction. Computer Methods in Applied Mechanics and Engineering, 306:196–
215, 2016.
[109] A. Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica,
8:143–195, 1999.
[110] M. Prasthofer, T. De Ryck, and S. Mishra. Variable-input deep operator networks. arXiv
preprint arXiv:2205.11404, 2022.
OPERATOR LEARNING 35
[111] E. Qian, I.-G. Farcas, and K. Willcox. Reduced operator inference for nonlinear partial
differential equations. SIAM Journal on Scientific Computing, to appear, 2022.
[112] E. Qian, B. Kramer, B. Peherstorfer, and K. Willcox. Lift & Learn: Physics-informed ma-
chine learning for large-scale nonlinear dynamical systems. Physica D: Nonlinear Phenom-
ena, 406:132401, 2020.
[113] M. Raghu and E. Schmidt. A survey of deep learning for scientific discovery. arXiv preprint
arXiv:2003.11755, 2020.
[114] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Twenty-First
Annual Conference on Neural Information Processing Systems, 2007.
[115] M. A. Rahman, Z. E. Ross, and K. Azizzadenesheli. U-NO: U-shaped neural operators.
Transactions on Machine Learning Research, 2023.
[116] J. O. Ramsay and C. Dalzell. Some tools for functional data analysis. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 53(3):539–561, 1991.
[117] B. Raonic, R. Molinaro, T. Rohner, S. Mishra, and E. de Bezenac. Convolutional neural
operators. In ICLR 2023 Workshop on Physics for Machine Learning, 2023.
[118] B. Raonic, R. Molinaro, T. D. Ryck, T. Rohner, F. Bartolucci, R. Alaifari, S. Mishra, and
E. de Bezenac. Convolutional neural operators for robust and accurate learning of PDEs. In
Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[119] L. Rosasco, M. Belkin, and E. D. Vito. On learning with integral operators. The Journal of
Machine Learning Research, 11:905–934, 2010.
[120] T. D. Ryck and S. Mishra. Generic bounds on the approximation error for physics-informed
(and) operator learning. In Thirty-sixth Annual Conference on Neural Information Process-
ing Systems, 2022.
[121] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer
experiments. Statistical Science, 4(4):409 – 423, 1989.
[122] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU acti-
vation function. The Annals of Statistics, 48(4):1875 – 1897, 2020.
[123] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for Lipschitz
operators. arXiv preprint arXiv:2307.09835, 2023.
[124] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for
generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55,
2019.
[125] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates
for analytic functions in L2 (Rd , γd ). SIAM/ASA Journal on Uncertainty Quantification,
11(1):199–234, 2023.
[126] J. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold de-
coders for operator learning. In Thirty-sixth Annual Conference on Neural Information
Processing Systems, 2022.
[127] N. Sharp, S. Attaiki, K. Crane, and M. Ovsjanikov. Diffusionnet: Discretization agnostic
learning on surfaces. ACM Transactions on Graphics, 2022.
[128] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions. Neural Net-
works, 119:74–84, 2019.
[129] Z. Shen, H. Yang, and S. Zhang. Neural network approximation: Three hidden layers are
enough. Neural Networks, 141:160–173, 2021.
[130] J. W. Siegel. Optimal approximation rates for deep ReLU neural networks on Sobolev and
Besov spaces. Journal of Machine Learning Research, 24(357):1–52, 2023.
[131] A. Stanziola, S. R. Arridge, B. T. Cox, and B. E. Treeby. A Helmholtz equation solver using
unsupervised learning: Application to transcranial ultrasound. Journal of Computational
Physics, 441:110430, 2021.
[132] G. Stepaniants. Learning partial differential equations in reproducing kernel hilbert spaces.
Journal of Machine Learning Research, 24(86):1–72, 2023.
[133] A. M. Stuart. Inverse problems: A Bayesian perspective. Acta Numerica, 19:451–559, 2010.
[134] R. Swischuk, L. Mainini, B. Peherstorfer, and K. Willcox. Projection-based model reduction:
Formulations for physics-based machine learning. Computers & Fluids, 179:704–717, 2019.
[135] T. Tripura and S. Chakraborty. Wavelet neural operator for solving parametric partial dif-
ferential equations in computational mechanics problems. Computer Methods in Applied
Mechanics and Engineering, 404:115783, 2023.
[136] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin. Attention is all you need. In Thirty-first Annual Conference on Neural In-
formation Processing Systems, 2017.
[137] C. Wang and A. Townsend. Operator learning for hyperbolic partial differential equations.
arXiv preprint arXiv:2312.17489, 2023.
36 NIKOLA B. KOVACHKI, SAMUEL LANTHALER, AND ANDREW M. STUART
[138] D. Wang, Z. Zhao, Y. Yu, and R. Willett. Functional linear regression with mixed predictors.
The Journal of Machine Learning Research, 23(1):12181–12274, 2022.
[139] J.-X. Wang, J.-L. Wu, and H. Xiao. Physics-informed machine learning approach for recon-
structing Reynolds stress modeling discrepancies based on DNS data. Phys. Rev. Fluids,
2:034603, 2017.
[140] J.-L. Wu, H. Xiao, and E. Paterson. Physics-informed machine learning approach for aug-
menting turbulence models: A comprehensive framework. Phys. Rev. Fluids, 3:074602, 2018.
[141] K. Xu, D. Z. Huang, and E. Darve. Learning constitutive relations using symmetric positive
definite neural networks. Journal of Computational Physics, 428:110072, 2021.
[142] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks,
94:103–114, 2017.
[143] D. Yarotsky. Elementary superexpressive activations. In Thirty-eighth International Con-
ference on Machine Learning, 2021.
[144] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural
networks. In Thirty-fourth Annual Conference on Neural Information Processing Systems,
2020.
[145] E. Yeung, S. Kundu, and N. Hodas. Learning deep neural network representations for
Koopman operators of nonlinear dynamical systems. In 2019 American Control Confer-
ence (ACC), 2019.
[146] L. Yi, H. Su, X. Guo, and L. J. Guibas. Syncspeccnn: Synchronized spectral CNN for 3D
shape segmentation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017.
[147] O. Zahm, P. G. Constantine, C. Prieur, and Y. M. Marzouk. Gradient-based dimension
reduction of multivariate vector-valued functions. SIAM Journal on Scientific Computing,
42(1):A534–A558, 2020.
[148] S. Zhang, Z. Shen, and H. Yang. Neural network architecture beyond width and depth. In
Thirty-sixth Annual Conference on Neural Information Processing Systems, 2022.
[149] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural
operator. Proceedings of the Royal Society A, 479, 2023.