Survey Ai
Survey Ai
Survey Ai
Gitta Kutyniok
Abstract
We currently witness the spectacular success of artificial intelligence in both science and public life. How-
ever, the development of a rigorous mathematical foundation is still at an early stage. In this survey article,
which is based on an invited lecture at the International Congress of Mathematicians 2022, we will in particular
focus on the current “workhorse” of artificial intelligence, namely deep neural networks. We will present the
main theoretical directions along with several exemplary results and discuss key open problems.
1 Introduction
Artificial intelligence is currently leading to one breakthrough after the other, both in public with, for instance,
autonomous driving and speech recognition, and in the sciences in areas such as medical diagnostics or molecular
dynamics. In addition, research on artificial intelligence and, in particular, on its theoretical foundations, is
progressing at an unprecedented rate. One can envision that according methodologies will in the future drastically
change the way we live in numerous respects.
1
The area of partial differential equations was much slower to embrace these new techniques, the reason being
that it was not per se evident what the advantage of methods from artificial intelligence for this field would
be. Indeed, there seems to be no need to utilize learning-type methods, since a partial differential equation is a
rigorous mathematical model. But, lately, the observation that deep neural networks are able to beat the curse
of dimensionality in high dimensional settings led to a change of paradigm in this area as well. Research at the
intersection of numerical analysis of partial differential equations and artificial intelligence therefore accelerated
since about 2017. We will delve further into this topic in Subsection 4.2.
Artificial Intelligence for Mathematical Problems. This direction focuses on mathematical problem settings
such as inverse problems and partial differential equations with the goal to employ methodologies from
artificial intelligence to develop superior solvers.
1.5 Outline
Both research directions will be discussed in this survey paper, showcasing some novel results and pointing out
key future challenges for mathematics. We start with an introduction into the mathematical setting, stating the
main definitions and notations (see Section 2). Next, in Section 3, we delve into the first main direction, namely
mathematical foundations for artificial intelligence, and discuss the research threads of expressivity, optimization,
2
generalization, and explainability. Section 4 is then devoted to the second main direction, which is artificial
intelligence for mathematical problems, and we highlight some exemplary results. Finally, Section 5 states the
seven main mathematical problems and concludes this article.
3
with W (`) ∈ RN` ×N`−1 being the weight matrices and b(`) ∈ RN` the bias vectors of the `th layer. Then
Φ : Rd → RNL , given by
Φ(x) = TL ρ(TL−1 ρ(. . . ρ(T1 (x)))), x ∈ Rd ,
is called (deep) neural network of depth L.
Let us already mention at this point that the weights and biases are the free parameters which will be learned
during the training process. An illustration of the multilayered structure of a deep neural network can be found
in Figure 1.
outputs. The task of the deep neural network is then to identify the relation between those. For instance,
in a classification problem, each output y (i) is considered to be the label of the respective class to which the
input x(i) belongs. One can also take the viewpoint that (x(i) , y (i) )m i=1 arises as samples from a function such as
e
g : M → {1, 2, . . . , K}, where M might be a lower-dimensional manifold of Rd , in the sense of y (i) = g(x(i) ) for
all i = 1, . . . , m.
e
The set (x(i) , y (i) )m (i) (i) m (i) (i) m
i=1 is then split into a training data set (x , y )i=1 and a test data set (x , y )i=m+1 .
e e
The training data set is—as the name indicates—used for training, whereas the test data set will later on solely
be exploited for testing the performance of the trained network. We emphasize that the neural network is not
exposed to the test data set during the entire training process.
Step 2 (Choice of architecture): For preparation of the learning algorithm, the architecture of the neural
network needs to be decided upon, which means the number of layers L, the number of neurons in each layer
(N` )L`=1 , and the activation function ρ have to be selected. It is known that a fully connected neural network is
often difficult to train, hence, in addition, one typically preselects certain entries of the weight matrices (W (`) )L
`=1
to already be set to zero at this point.
For later purposes, we define the selected class of deep neural networks by N N θ with θ encoding this chosen
architecture.
Step 3 (Training): The next step is the actual training process, which consists of learning the affine-linear
functions (T` )L `=1 = (W
(`)
· +b(`) )L
`=1 . This is accomplished by minimizing the empirical risk
m
b (W (`) ,b(`) ) ) := 1
X
R(Φ (Φ (`) (`) (x(i) ) − y (i) )2 . (2.1)
`
m i=1 (W ,b )`
A more general form of the optimization problem is
m
X
min L(Φ(W (`) ,b(`) )` (xi ), y (i) ) + λP((W (`) , b(`) )` ), (2.2)
(W (`) ,b(`) )`
i=1
4
where L is a loss function to determine a measure of closeness between the network evaluated in the training sam-
ples and the (known) values y (i) and where P is a penalty/regularization term to impose additional constraints
on the weight matrices and bias vectors.
One common algorithmic approach is gradient descent. Since, however, m is typically very large, this is computa-
tionally not feasible. This problem is circumvented by randomly selecting only a few gradients in each iteration,
assuming that they constitute a reasonable average, which is coined stochastic gradient descent.
Solving the optimization problem then yields a network Φ(W (`) ,b(`) )` : Rd → RNL , where
Step 4 (Testing): Finally, the performance (often also called generalization ability) of the trained neural
network is tested using the test data set (x(i) , y (i) )m
i=m+1 by analyzing whether
e
where we used the L2 -norm to measure the distance between f and g. The error between the trained deep neural
network Φ0 (:= Φ(W (`) ,b(`) )` ) ∈ N N θ and the optimal function g can then be estimated by
0 0
R(Φ ) ≤ R(Φ ) − inf R(Φ) + 2 sup |R(Φ) − R(Φ)|
b b b + inf R(Φ). (2.4)
Φ∈N N θ Φ∈N N θ Φ∈N N θ
| {z } | {z } | {z }
Optimization error Generalization error Approximation error
These considerations lead to the main research threads described in the following subsection.
5
the goal is to analyze the approximation error inf Φ∈N N θ R(Φ) from (2.4), which estimates the approxima-
tion accuracy when approximating g by the hypothesis class N N θ of deep neural networks of a particular
architecture. Typical methods for approaching this problem are from applied harmonic analysis and ap-
proximation theory.
Learning/Optimization. The main goal of this direction is the analysis of the training algorithm such as
stochastic gradient descent, in particular, asking why it usually converges to suitable local minima even
though the problem itself is highly non-convex. This requires the analysis of the optimization error, which
b 0 )−inf Φ∈N N R(Φ)
is R(Φ b (cf. (2.4)) and which measures the accuracy with which the learnt neural network
θ
0
Φ minimizes the empirical risk (2.1), (2.2). Key methodologies for attacking such problems come from the
areas of algebraic/differential geometry, optimal control, and optimization.
Generalization. This direction aims to derive an understanding of the out-of-sample error, namely,
supΦ∈N N θ |R(Φ) − R(Φ)|
b from (2.4), which measures the distance of the empirical risk (2.1), (2.2) and the
actual risk (2.3). Predominantly, learning theory, probability theory, and statistics provide the required
methods for this research thread.
A very exciting and highly relevant new research direction has recently emerged, coined explainability. At
present, it is from the standpoint of mathematical foundations still a wide open field.
Explainability. This direction considers deep neural networks, which are already trained, but no knowledge
about the training is available; a situation one encounters numerous times in practice. The goal is then to
derive a deep understanding of how a given trained deep neural network reaches decisions in the sense of
which features of the input data are crucial for a decision. The range of required approaches is quite broad,
including areas such as information theory or uncertainty quantification.
Inverse Problems. Research in this direction aims to improve classical model-based approaches to solve
inverse problems by exploiting methods of artificial intelligence. In order to not neglect domain knowledge
such as the physics of the problem, current approaches aim to take the best out of both worlds in the sense
of optimally combining model- and data-driven approaches. This research direction requires a variety of
techniques, foremost from areas such as imaging science, inverse problems, and microlocal analysis, to name
a few.
Partial Differential Equations. Similar to the area of inverse problems, here the goal is to improve classical
solvers of partial differential equations by using ideas from artificial intelligence. A particular focus is on high
dimensional problems in the sense of aiming to beat the curse of dimensionality. This direction obviously
requires methods from areas such as numerical mathematics and partial differential equations.
6
3.1 Expressivity
Expressivity is maybe the richest area at present in terms of mathematical results. The general question can
be phrased as follows: Given a function class/space C and a class of deep neural networks N N θ , how does the
approximation accuracy when approximating elements of C by networks Φ ∈ N N θ relate to the complexity of
such Φ? Making this precise thus requires the introduction of a complexity measure for deep neural networks. In
the sequel, we will choose the canonical one, which is the complexity in terms of memory requirements. Notice
though that certainly various other complexity measures exist. Further, recall that the k · k0 -“norm” counts the
number of non-zero components.
Definition 3.1. Retaining the same notation for deep neural networks as in Definition 2.2, the complexity C(Φ)
of a deep neural network Φ is defined by
L
X
C(Φ) := kW (`) k0 + kb(`) k0 .
`=1
The most well-known—and maybe even the first—result on expressivity is the universal approximation theorem
[8, 13]. It states that each continuous function on a compact domain can be approximated up to an arbitrary
accuracy by a shallow neural network.
Theorem 3.2. Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ρ : R → R continuous and not a polynomial.
Then, for each > 0, there exist N ∈ N and ak , bk ∈ R, wk ∈ Rd , 1 ≤ k ≤ N , such that
N
X
kf − ak ρ(hwk , ·i − bk )k∞ ≤ .
k=1
While this is certainly an interesting result, it is not satisfactory in several regards: It does not give bounds
on the complexity of the approximating neural network and also does not explain why depth is so important.
A particularly intriguing example for a result, which considers complexity and also targets a more sophisticated
function space, was derived in [31].
Theorem 3.3. For all f ∈ C s ([0, 1]d ) and ρ(x) = max{0, x}, i.e., the ReLU, there exist neural networks (Φn )n∈N
with the number of layers of Φn being approximately of the order of log(n) such that
s
kf − Φn k∞ . C(Φn )− d → 0 as n → ∞.
This result provides a beautiful connection between approximation accuracy and complexity of the approx-
imating neural network, and also to some extent takes the depth of the network into account. However, to
derive a result on optimal approximations, we first require a lower bound. The so-called VC-dimension (Vapnik-
Chervonenkis-dimension) (see also (3.2)) was for a long time the main method for achieving such lower bounds.
We will recall here a newer result from [7] in terms of the optimal exponent γ ∗ (C) from information theory to
measure the complexity of C ⊂ L2 (Rd ). Notice that we will only state the essence of this result without all
technicalities.
Theorem 3.4. Let d ∈ N, ρ : R → R, and let C ⊂ L2 (Rd ). Further, let
Learn : (0, 1) × C → N N θ
7
This conceptual lower bound, which is independent of any learning algorithm, now allows to derive results
on approximations with neural networks, which have optimally small complexity in the sense of being memory-
optimal. We will next provide an example of such a result, which at the same time answers another question
as well. The universal approximation theorem already indicates that deep neural networks seem to have a
universality property in the sense of performing at least as good as polynomial approximation. One can now ask
whether neural networks also perform as well as other existing approximation schemes such as wavelets, or the
more sophisticated system of shearlets [16].
For this, let us briefly recall this system and its approximation properties. Shearlets are based on parabolic
scaling, i.e., j
2 0
A2 =
j , j∈Z
0 2j/2
and Ã2j = diag(2j/2 , 2j ) as well as changing the orientation via shearing defined by
1 k
Sk = , k ∈ Z.
0 1
(Cone-adapted) discrete shearlet systems can then be defined as follows, cf. [17]. A faithful implementation of
the shearlet transform as a 2D and 3D (parallelized) fast shearlet transform can be found at www.ShearLab.org.
Definition 3.5. The (cone-adapted) discrete shearlet system SH(φ, ψ, ψ̃) generated by φ ∈ L2 (R2 ) and ψ, ψ̃ ∈
L2 (R2 ) is the union of
{φ(· − m) : m ∈ Z2 },
{23j/4 ψ(Sk A2j · −m) : j ≥ 0, |k| ≤ d2j/2 e, m ∈ Z2 },
{23j/4 ψ̃(SkT Ã2j · −m) : j ≥ 0, |k| ≤ d2j/2 e, m ∈ Z2 }.
Since multivariate problems are typically governed by anisotropic features such as edges in images or shock
fronts in the solution of transport-dominated equations, the following suitable model class of functions was
introduced in [9].
Definition 3.6. The set of cartoon-like functions E 2 (R2 ) is defined by
E 2 (R2 ) = {f ∈ L2 (R2 ) : f = f0 + f1 · χB },
6 B ⊂ [0, 1]2 is simply connected with a C 2 -curve with bounded curvature as its boundary, and fi ∈
where ∅ =
C (R ) with supp fi ⊆ [0, 1]2 and kfi kC 2 ≤ 1, i = 0, 1.
2 2
While wavelets are deficient in optimally approximating cartoon-like functions due to their isotropic structure,
shearlets provide an optimal (sparse) approximation rate up to a log-factor. The following statement is taken from
[17], where also the precise hypotheses can be found. Notice that the justification for optimality is a benchmark
result from [9].
ˆ
Theorem 3.7. Let φ, ψ, ψ̃ ∈ L2 (R2 ) be compactly supported, and let ψ̂, ψ̃ satisfy certain decay conditions. Then
2 2
SH(φ, ψ, ψ̃) provides an optimally sparse approximation of f ∈ E (R ), i.e.,
3
kf − fN k2 . N −1 (log N ) 2 as N → ∞.
One can now use Theorem 3.4 to show that indeed deep neural networks are as good approximators as shearlets
and in fact as all affine systems. Even more, the construction in the proof of suitable neural networks, which
mimics best N -term approximations, also leads to memory-optimal neural networks. The resulting statement
from [7] in addition proves that the bound in Theorem 3.4 is sharp.
Theorem 3.8. Let ρ be a suitably chosen activation function, and let > 0. Then, for all f ∈ E 2 (R2 ) and
N ∈ N, there exists a neural network Φ with complexity O(N ) and activation function ρ with
kf − Φk2 . N −1+ → 0 as N → ∞.
8
Summarizing, one can conclude that deep neural networks achieve optimal approximation properties of all
affine systems combined.
Let us finally mention that lately a very different viewpoint of expressivity was introduced in [21] according
to so-called trajectory lengths. The standpoint taken in this work is to measure expressivity in terms of changes
of the expected length of a (non-constant) curve in the input space as it propagates through layers of a neural
network.
3.2 Optimization
This area aims to analyze optimization algorithms, which solve the (learning) problem in (2.1), or, more generally,
(2.2). A common approach is gradient descent, since the gradient of the loss function (or optimized functional)
with respect to the weight matrices and biases, i.e., the parameters of the network, can be computed exactly.
This is done via backpropagation [27], which is in a certain sense merely an efficient application of the chain
rule. However, since the number of training samples is typically in the millions, it is computationally infeasible
to compute the gradient on each sample. Therefore, in each iteration only one or several (a batch) randomly
selected gradients are computed, leading to the algorithm of stochastic gradient descent [25].
In convex settings, guarantees for convergence of stochastic gradient descent do exist. However, in the neural
network setting, the optimization problem is non-convex, which makes it—even when using a non-random version
of gradient descent—very hard to analyze. Including randomness adds another level of difficulty as is depicted in
Figure 2, where the two algorithms reach different (local) minima.
This area is by far less explored than expressivity. Most current results focus on shallow neural networks, and
for a survey, we refer to [6].
3.3 Generalization
This research direction is perhaps the least explored and maybe also the most difficult one, sometimes called the
“holy grail” of understanding deep neural networks. It targets the out-of-sample error
9
Underfitting Overfitting
Figure 3: Phenomenon of overfitting for the task of classification with two classes
Let us now analyze the generalization error in (3.1) in a bit more depth. For a large number m of training
samples the law of large numbers tells us that with high probability R(Φ)
b ≈ R(Φ) for each neural network
Φ ∈ N N θ . Bounding the complexity of the hypothesis class N N θ by the VC-dimension, the generalization error
can be bounded with probability 1 − δ by
r
VCdim(N N θ ) + log(1/δ)
. (3.2)
m
For classes of highly over-parametrized neural networks, i.e., where VCdim(N N θ ) is very large, we need an
enormous amount of training data to keep the generalization error under control. It is thus more than surprising
that numerical experiments show the phenomenon of a so-called double descent curve [5]. More precisely, the
test error was found to decrease after passing the interpolation point, followed by an increase consistent with
statistical learning theory (see Figure 4).
Error
Training Error
3.4 Explainability
The area of explainability aims to “open the black box” of deep neural networks in the sense as to explain decisions
of trained neural networks. These explanations typically consist of providing relevance scores for features of the
input data. Most approaches focus on the task of image classification and provide relevance scores for each pixel
of the input image. One can roughly categorize the different types of approaches into gradient-based methods
[28], propagation of activations in neurons [4], surrogate models [24], and game-theoretic approaches [19].
We would now like to describe in more detail an approach which is based on information theory and also
allows an extension to different modalities such as audio data as well as analyzing the relevance of higher-level
features; for a survey paper, we refer to [15]. This rate-distortion explanation (RDE) framework was introduced
in 2019 and later extended by applying RDE to non-canonical input representations.
Let now Φ : Rd → Rn be a trained neural network, and x ∈ Rd . The goal of RDE is to provide an explanation
for the decision Φ(x) in terms of a sparse mask s ∈ {0, 1}d which highlights the crucial input features of x. This
10
mask is determined by the following optimization problem:
min E d(Φ(x), Φ(x s + (1 − s) v)) subject to ksk0 ≤ `,
s∈{0,1}d v∼V
where denotes the Hadamard product, d is a measure of distortion such as the `2 -distance, V is a distribution
over input perturbations v ∈ Rd , and ` ∈ {1, ..., d} is a given sparsity level for the explanation mask s. The key
idea is that a solution s∗ is a mask marking few components of the input x which are sufficient to approximately
retain the decision Φ(x). This viewpoint reveals the relation to rate-distortion theory, which normally focusses
on lossy compression of data.
Since it is computationally infeasible to compute such a minimizer (see [30]), a relaxed optimization problem
providing continuous masks s ∈ [0, 1]d is used in practice:
min E d(Φ(x), Φ(x s + (1 − s) v)) + λksk1 ,
s∈[0,1]d v∼V
where λ > 0 determines the sparsity level of the mask. The minimizer now assigns each component xi of the
input—in case of images each pixel—a relevance score si ∈ [0, 1]. This is typically referred to as Pixel RDE.
Extensions of the RDE-framework allow the incorporation of different distributions V better adapted to data
distributions. Another recent improvement was the assignment of relevance scores to higher-level features such
as arising from a wavelet decomposition, which ultimately led to the approach CartoonX. An example of Pixel
RDE versus CartoonX, which also shows the ability of the higher-level explanations of CartoonX to give insights
into what the neural network saw when misclassifying an image, is depicted in Figure 5.
Figure 5: Pixel RDE versus CartoonX for analyzing misclassifications of a deep neural network
11
with deep learning in the sense of taking the best out of the model- and data-world.
To introduce such results, we start by recalling some basics about solvers of inverse problems. For this, assume
that we are given an (ill-posed) inverse problem
Kf = g, (4.1)
where K : X → Y is an operator and X and Y are, for instance, Hilbert spaces. Drawing from the area of
imaging science, examples include denoising, deblurring, or inpainting (recovery of missing parts of an image).
Most classical solvers are of the form (which includes Tikhonov regularization)
h i
f α := argmin kKf − gk2 + α · P(f ) ,
f | {z } | {z }
Data fidelity term Penalty/Regularization term
where P : X → R and f α ∈ X, α > 0 is an approximate solution of the inverse problem (4.1). One very popular
and widely applicable special case is sparse regularization, where P is chosen by
12
Original Sparse Regularization with Shearlets Deep Microlocal Reconstruction [3]
hybrid type and takes the best out of both worlds in the sense of combining model- and artificial intelligence-
based approaches.
Finally, the deep learning-based wavefront set extraction itself is yet another evidence of the improvements on
the state-of-the-art now possible by artificial intelligence. Figure 7 shows a classical result from [23], whereas [2]
uses the shearlet transform as a coarse edge detector, which is subsequently combined with a deep neural network.
13
Let L(uy , y) = fy denote a parametric partial differential equation with y being a parameter from a high-
dimensional parameter space Y ⊆ Rp and uy the associated solution in a Hilbert space H. After a high-fidelity
discretization, let by (uhy , v) = fy (v) be the associated variational form with uhy now belonging to the associated
high-dimensional space U h , where we set D := dim(U h ). We moreover denote the coefficient vector of uhy with
respect to a suitable basis of U h by uhy . Of key importance in this area is the parametric map given by
Acknowledgments
This research was partly supported by the Bavarian High-Tech Agenda, DFG-SFB/TR 109 Grant C09, DFG-SPP
1798 Grant KU 1446/21-2, DFG-SPP 2298 Grant KU 1446/32-1, and NSF-Simons Research Collaboration on
the Mathematical and Scientific Foundations of Deep Learning (MoDL) (NSF DMS 2031985).
The author would like to thank Hector Andrade Loarca, Adalbert Fono, Stefan Kolek, Yunseok Lee, Philipp
Scholl, Mariia Seleznova, and Laura Thesing for their helpful feedback on an early version of this article.
14
References
[1] J. Adler and O. Öktem. Solving ill-posed inverse problems using iterative deep neural networks. Inverse
Probl. 33 (2017), 124007.
[2] H. Andrade-Loarca, G. Kutyniok, O. Öktem, and P. Petersen. Extraction of digital wavefront sets using
applied harmonic analysis and deep neural networks. SIAM J. Imaging Sci. 12 (2019), 1936–1966.
[3] H. Andrade-Loarca, G. Kutyniok, O. Öktem, and P. Petersen. Deep Microlocal Reconstruction for Limited-
Angle Tomography. (2021), arXiv:2108.05732.
[4] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations
for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10 (2015), e0130140.
[5] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proc. Natl. Acad. Sci. USA 116 (2019), 15849–15854.
[6] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The Modern Mathematics of Deep Learning. In: Mathe-
matical Aspects of Deep Learning, Cambridge University Press, to appear.
[7] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal Approximation with Sparsely Connected Deep
Neural Networks. SIAM J. Math. Data Sci. 1 (2019), 8–45.
[8] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signal 2 (1989), 303–
314.
[9] D. Donoho. Sparse components of images and optimal atomic decompositions. Constr. Approx. 17 (2001),
353–382.
[10] W. E and B. Yu. The deep ritz method: a deep learning-based numerical algorithm for solving variational
problems. Commun. Math. Stat. 6 (2018), 1–12.
[11] M. Geist, P. Petersen, M. Raslan, R. Schneider, and G. Kutyniok. Numerical Solution of the Parametric
Diffusion Equation by Deep Neural Networks. J. Sci. Comput. 88 (2021), Article number: 22.
[12] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning.
Proc. Natl. Acad. Sci. USA 115 (2018), 8505–8510.
[13] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.
Neural Netw. 2, 359–366 (1989).
[14] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser. Deep convolutional neural network for inverse problems
in imaging, IEEE Trans. Image Process. 26 (2017), 4509–4522.
[15] S. Kolek, D. A. Nguyen, R. Levie, J. Bruna, and G. Kutyniok. A rate-distortion framework forexplaining
black-box model decisions. In: Springer LNAI Volume: xxAI - Beyond Explainable AI, to appear.
[16] G. Kutyniok and D. Labate. Introduction to Shearlets. In: Shearlets: Multiscale Analysis for Multivariate
Data, 1–38, Birkhäuser Boston, 2012.
[17] G. Kutyniok and W.-Q Lim. Compactly supported shearlets are optimally sparse. J. Approx. Theory 163
(2011), 1564–1589.
[18] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A Theoretical Analysis of DeepNeural Networks and
Parametric PDEs. Constr. Approx. 55 (2022), 73–125.
[19] S. M. Lundberg and S–I. Lee. A unified approach to interpreting model predictions. In: NeurIPS (2017),
4768–4777.
15
[20] S. Lunz, O. Öktem, and C.-B. Schönlieb. Adversarial regularizers in inverse problems. In: NIPS (2018),
8507–8516.
[21] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural
networks, In: ICML (2017), 2847–2854.
[22] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning frame-
work for solving forward and inverse problems involving nonlinear pdes. J. Comput. Phys. 378 (2019),
686–707.
[23] R. Reisenhofer, J. Kiefer, and E. J. King. Shearlet-based detection of flame fronts. Exp. Fluids 57 (2015),
11.
[24] M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: Explaining the predictions of any
classifier. In: ACM SIGKDD (2016), 1135–1144.
[25] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist. 22 (1952), 400–407.
[26] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising. SIAM J.
Imaging Sci. 10 (2017), 1804–1844.
[30] S. Wäldchen, J. Macdonald, S. Hauch, and G. Kutyniok. The computational complexity of under-standing
network decisions. J. Artif. Intell. Res. 70 (2021), 351–387.
[31] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw. 94 (2017), 103–114.
16