CH Quadrature

Contents
7 Other integration methods 3

7.1 The midpoint rule . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7.2 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.3 Higher order rules . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7.4 Fubini, Bahkvalov and curse of dimension . . . . . . . . . . . . . 12
7.5 Hybrids with Monte Carlo . . . . . . . . . . . . . . . . . . . . . 15
7.6 Laplace approximations . . . . . . . . . . . . . . . . . . . . . . . 16
7.7 Weighted spaces and tractability . . . . . . . . . . . . . . . . . . 19
7.8 Sparse grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
End notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
2 Contents
© Art Owen 2009–2019 do not distribute or post electronically without

author’s permission
7
Other integration methods
Most of our Monte Carlo problems involve estimating expectations and these
can often be written as integrals. Sometimes it pays to treat the problems
as integrals and apply non-random numerical integration methods. In other
settings we may be able to combine Monte Carlo and other methods into hybrid
estimators. For instance, a nearly exact numerical integral of a problem related
to our own, may be used as a control variate, as described in §8.9. This chapter
is specialized and may be skipped on first reading.
There is a large literature on numerical integration. Here, we look at a few
of the main ideas. Of special importance are the midpoint rule and Simpson’s
rule, for integrating over a finite interval [a, b]. They are simple to use and bring
enormous improvements for smooth functions, and extend well to small dimen-
sions d. We will also see how the advantage of classical quadrature methods
decays rapidly with increasing dimension. This phenomenon is a manifestation
of Bellman’s ‘curse of dimensionality’, with Monte Carlo versions in two classic
theorems of Bakhvalov. This curse is about worst case results and sometimes
we are presented with problems that are more favorable than the worst case.
We consider some results on ‘weighted spaces’ from which the worst functions
are excluded. We also study ‘sparse grids’ in which the number of evaluation
points required grows very slowly with dimension.
The connection between Monte Carlo and integration problems is as follows.
Suppose that our goal is to estimate E(g(y)) where y ∈ Rs has probability
density function p. We find a transformation function ψ(·), using methods like
those in Chapter 5, such that y = ψ(x) ∼ p when x ∼ U(0, 1)d . Then
Z Z Z
E(g(y)) = g(y)p(y) dy = g(ψ(x)) dx = f (x) dx,
Rs (0,1)d (0,1)d
3
4 7. Other integration methods
where f (·) = g(ψ(·)). As a result our Monte Carlo problem can be transformed
into a d-dimensional quadrature. We don’t always have d = s. This method does
not work when acceptance-rejection sampling is included in the way we generate
y, because there is no a priori bound on the number of uniform random variables
that we would need.
Since we’re computing integrals and not necessarily expectations we use the
Rb
symbol I for the quantity of interest. For instance, with I = a f (x) dx we have
I = (b − a)µ where µ = E(f (x)) for x ∼ U[a, b].
When f has a simple closed form, there is always the possibility that I can
be found symbolically. Tools such as Mathematica™, Maple™, and sage can
solve many integration problems. When symbolic computation cannot solve the
problem then we might turn to numerical methods instead.
Numerical integration is variously called quadrature or cubature. Some au-
thors reserve quadrature for the case where y ∈ R because the integral is the
limit of a sum of quadrilateral areas (rectangles or trapezoids). They then use
cubature for more general input dimensions. Hypercubature might be even more
appropriate, especially for d > 3, but that term is seldom used.
7.1 The midpoint rule

We start with a one-dimensional problem. Suppose that we want to estimate
the integral
Z b
I= f (x) dx
a
for −∞ < a < b < ∞.

The value of I is the area under the curve f over the interval [a, b]. It is
easy to compute the area under a piecewise constant curve, and so it is natural
to approximate f by a piecewise constant function fˆ and then estimate I by
Rb
Iˆ = a fˆ(x) dx. We let a = x0 < x1 < · · · < xn = b and then take ti with
xi−1 6 ti 6 xi for i = 1, . . . , n, and put fˆ(x) = f (ti ) whenever xi−1 6 x < xi .
To complete the definition, take fˆ(b) = f (b). Then
Z b n
X
Iˆ = fˆ(x) dx = (xi − xi−1 )f (ti ).
a i=1
If f is Riemann integrable on [a, b] then Iˆ − I → 0 as n → ∞ as long as

max16i6n (xi − xi−1 ) → 0.
There is a lot of flexibility in choosing fˆ but unless we have special knowledge
about f we might as well use n equal intervals of length (b − a)/n and take ti
in the middle of the i’th interval. This choice yields the midpoint rule
n
b−aX i − 1/2
Iˆ = f a + (b − a) . (7.1)
n i=1 n

7.1. The midpoint rule 5
If we have constructed f so that a = 0 and b = 1 then the midpoint rule

simplifies to
n
1 X i − 1/2
Iˆ = f .
n i=1 n
R∞
For example, if z ∼ N (0, 1) and we want to estimate E(g(z)) = −∞
g(z)ϕ(z) dz
then we can use
n
1 X −1 i − 1/2
g Φ .
n i=1 n
R1
In so doing, we are using the midpoint rule to estimate 0 f (x) dx where f (x) =
g(Φ−1 (x)). For now we will suppose that g is a bounded function so that f is
also bounded. We revisit the problem of unbounded integrands on page 7.
For a smooth function and large n, the midpoint rule attains a much better
rate than Monte Carlo sampling.
Theorem 7.1. Let f (x) be a real-valued function on [a, b] for −∞ < a <
b < ∞. Assume that the second derivative f 00 (x) is continuous on [a, b]. Let
ti = a + (b − a)(i − 1/2)/n for i = 1, . . . , n. Then
Z n

b b−aX (b − a)3
f (x) dx − f (ti ) 6 max |f 00 (z)|.

n i=1 24n2 a6z6b

a
Proof. For any x between xi−1 ≡ ti − (b − a)/(2n) and xi ≡ ti + (b − a)/(2n),

we can write f (x) = f (ti ) + f 0 (ti )(x − ti ) + (1/2)f 00
(z(x))(x − ti )2 where z(x) is
n
a point between x and ti . Let Iˆ = ((b − a)/n) i=1 f (ti ). Then
P
Z b n
ˆ =
b−aX
|I − I| f (x) dx − f (ti )

a n i=1
n xi
X Z

= f (x) − f (ti ) dx
i=1 xi−1
n xi
X Z
0 1 00 2

=
f (ti )(x − ti ) + f (z(x))(x − ti ) dx
i=1 xi−1
2
n Z
1 X xi 00

2

= f (z(x))(x − ti ) dx.
2 i=1 xi−1
Because f 00 is continuous on [a, b] and that interval is compact, f 00 is absolutely

continuous there and hence M = maxa6x6b |f 00 (x)| < ∞. To complete the proof
we write
n Z
M X xi M n x1
Z
ˆ 6 2
|I − I| (x − ti )2 dx = (x − x1 /2) dx (7.2)
2 i=1 xi−1 2 0

by symmetry. Then with x1 = (b − a)/n,

x1 x1 /2
(b − a)3
Z Z
2 x1 3
(x − x1 /2)2 dx = 2 x2 dx = = . (7.3)
0 0 3 2 12n3
The result follows by substituting (7.3) into (7.2).

The midpoint rule is very simple to use and it works well on one-dimensional
smooth functions. The rate O(n−2 ) is much better than the O(n−1/2 ) root
mean square error (RMSE) from Monte Carlo. The proof in Theorem 7.1 is
fairly simple. A sharper analysis, in Davis and Rabinowitz (1984, Chapter 4.3)
shows that
(b − a)3 00
Iˆ − I = f (ẑ)
24n2
holds for some ẑ ∈ (a, b), under the conditions of Theorem 7.1.
Error estimation is awkward for classical numerical integration rules. When
f 00 (x) is continuous on [a, b] then the midpoint rule guarantees that |Iˆ − I| 6
(b − a)3 M/(24n2 ), where M = maxa6z6b |f 00 (z)|. This looks like a 100% confi-
dence interval. It would be, if we knew M , but unfortunately, we usually don’t
know M .
The midpoint rule is the integral of a very simple piecewise constant ap-
proximation to f . We could instead approximate f by a piecewise linear func-
tion over each interval [xi−1 , xi ]. If once again, we take equispaced values
xi = a + i(b − a)/n we get the approximate function fe that on the interval
[xi−1 , xi ] satisfies
x − xi−1
fe(x) = f (xi−1 ) + f (xi ) − f (xi−1 ) .
xi − xi−1
The integral of fe(x) over [a, b] yields the trapezoid rule

n−1
b − ah 1 X 1 i
Ie = f (a) + f (xi ) + f (b) .
n 2 i=1
2
The trapezoid rule is based on a piecewise linear approximation fe to f

instead of a piecewise constant one fˆ. For large n, the function fe is usually
much closer to f than fˆ is, and so we might expect the trapezoid rule to attain
a higher rate of convergence than the midpoint rule. But we would be wrong.
Under the conditions of Theorem 7.1, Ie − I = −(b − a)3 f 00 (e
z )/(12n2 ) for some
ze ∈ (a, b) and so
(b − a)3
|Ie − I| 6 max |f 00 (z)|.
12n2 a6z6b
The trapezoid rule integrates correctly any function f that is piecewise linear
on each segment [xi−1 , xi ], by using two evaluation points at the ends of the
segments. The midpoint rule also integrates such a function correctly using
just one point in the middle of each segment. The midpoint rule has benefitted

7.2. Simpson’s rule 7
from an error cancellation. This kind of cancellation plays a big role in the
development of classical quadrature methods.
Another fact about these two methods is worth mentioning: the bracketing
inequality. Suppose that we know f 00 (x) > 0 for all x ∈ [a, b]. Then Iˆ − I =
f 00 (ẑ)(b − a)3 /(24n2 ) > 0 and Ie − I = −f 00 (e
z )(b − a)3 /(14n2 ) 6 0 and therefore
Iˆ 6 I 6 I.
e (7.4)
Equation (7.4) supplies us with a computable interval certain to contain the

answer (when f 00 > 0 is continuous on [a, b]), that is, a 100% confidence interval.
It is unusual to get such a good error estimate in a deterministic problem. Of
course it requires knowledge that f 00 > 0 everywhere. If f 00 (x) 6 0 is continuous,
then the inequalities in (7.4) are reversed and we still get a bracketing.
Singularities at the endpoints

The midpoint rule has a big practical advantage over the trapezoid rule. It does
not evaluate f at either endpoint a or b. Many of the integrals that we apply
Monte Carlo methods to diverge to infinity at one or both endpoints. In such
cases, the midpoint rule avoids the singularity.
There are numerous mathematical techniques for removing singularities.
These
R includeR change of variable
R transformations as well as methods Rthat write
f (x) dx = f0 (x) dx + f1 (x) dx where f0 has a singularity but f0 (x) dx
is known, and f1 has no singularity.
When we have no such analysis of our integrand, perhaps because it has a
complicated problem-dependent formulation, or because we have hundreds of
integrands to consider simultaneously, then avoiding the singularity is attrac-
tive. By contrast, the trapezoid rule does not avoid the endpoints x = a and
x = b. For such methods a second, less attractive principle is to ignore the
singularity, perhaps by using f (xi ) = 0 at any sample point xi where f is sin-
gular. Davis and Rabinowitz (1984, p. 180) remark that ignoring the singularity
is a “tricky business and should be avoided where possible”.
If |f (x)| → ∞ as x → a or b then of course we won’t have f 00 bounded on
[a, b], and so we can no longer expect an error of O(n−2 ). The midpoint rule
handles this singularity problem much more gracefully than the trapezoid rule
does. See Lubinsky and Rabinowitz (1984) and references therein for informa-
tion on convergence rates that are attained on integration problems containing
singularities.
7.2 Simpson’s rule

The midpoint and trapezoid rules are based on correctly integrating piecewise
constant and linear approximations to the integrand. That idea extends natu-
rally to methods that locally integrate higher order polynomials. The result is
much more accurate integration, at least when the integrand is smooth.

As a next step, we find a three-point rule that correctly integrates any

quadratic polynomial over [0, 1]. It is enough to correctly integrate 1, x and
x2 . If we evaluate the function at points 0, 1/2 and 1 and use a rule of the form
w1 f (0) + w2 f (1/2) + w3 f (1), the correct weights wj can be found by solving
    
1 1 1 w1 1
0 1/2 1 w2  = 1/2 .
0 1/4 1 w3 1/3
That is, we take

   −1    
w1 1 1 1 1 1/6
w2  = 0 1/2 1 1/2 = 2/3 .
w3 0 1/4 1 1/3 1/6
By a change of variable, this three-point rule can be adapted to the interval

[0, 2h] through
Z 2h 1
. 2 1
f (x) dx = 2h f (0) + f (h) + f (2h) , (7.5)
0 6 3 6
which is exact for quadratic functions f . Now we split the interval [a, b] into
n/2 panels of width 2h (so n has to be even) and apply equation (7.5) in each
of them. Letting fi be a shorthand notation for f (a + ih) where h = (b − a)/n,
the result is
Z b
. h

f (x) dx = f0 + 4f1 + 2f2 + 4f3 + · · · + 2fn−2 + 4fn−1 + fn
a 3
n/2−1 n/2
h X X
= f0 + 4 f2j−1 + 2 f2j + fn . (7.6)
3 j=1 j=1
Equation (7.5) is known as Simpson’s rule. Equation (7.6) is a compound

Simpson rule that is also called Simpson’s rule. Equation (7.5) is exact for cubic
functions, not just quadratics. As a result (7.6) is exact for a piecewise cubic
approximation to f . If f has a continuous fourth derivative f (4) on [a, b] then
the error in Simpson’s rule is
(b − a)4 (4)
− f (z), for some z ∈ (a, b).
180 n4
7.3 Higher order rules

The idea behind Simpson’s rule generalizes easily to higher orders. We split
the interval [a, b] into panels, find a rule that integrates a polynomial correctly
within a panel, and then apply it within every panel to get a compound rule.
There are two main varieties of compound quadrature rule. For open rules
we do not evaluate f at the end-points of the panel. The midpoint rule is open.

7.3. Higher order rules 9
For closed rules we do evaluate f at the end-points of the panel. The trapezoid
rule and Simpson’s rule are both closed. Closed rules have the advantage that
some function evaluations get reused when we increase n. Open rules have
a perhaps greater advantage that they avoid the ends of the interval where
singularities often appear.
Newton-Cotes
The trapezoid rule and Simpson’s rule use m = 2 and m = 3 points respectively
within each panel. In general, one can use m points to integrate polynomials
of degree m − 1, to yield the Newton-Cotes formulas, of which the trapezoid
rule and Simpson’s rule are special cases. The Newton-Cotes rule for m = 4 is
another of Simpson’s rules, called Simpson’s 3/8 rule. Newton-Cotes rules of
odd order have the advantage that, by symmetry, they also correctly integrate
polynomials of degree m, as we saw already in the case of Simpson’s rule.
The next two odd order rules are
Z 4h
. h

f (x) dx = 14f0 + 64f1 + 24f2 + 64f3 + 14f4 , and (7.7)
0 45
Z 6h
. h

f (x) dx = 41f0 + 216f1 + 27f2 + 272f3 + 27f4 + 216f5 + 41f6 .
0 140
These are the 5 and 7 point closed Newton-Cotes formulas. Equation (7.7) is
known as Boole’s rule.
High order rules should be used with caution. They exploit high order
smoothness in the integrand, but can give poor outcomes when the integrand is
not as smooth as they require. In particular, if a genuinely smooth quantity has
some mild nonsmoothness in its numerical implementation f , then high order
integration rules can behave very badly, magnifying this numerical noise.
As a further caution, Davis and Rabinowitz (1984) note that taking f fixed
and letting the order m in a Newton-Cotes formula increase does not always
converge to the right answer even for f with infinitely many derivatives. Lower
order rules applied in panels are more robust.
The Newton-Cotes rules can be made into compound rules similarly to the
way Simpson’s rule was compounded. When the basic method integrates poly-
nomials of degree r exactly within panels, then the compound method has error
O(n−r ), assuming that f (r) is continuous on [a, b].
As noted above, open rules are valuable because they avoid the endpoints
where the function may be singular. Here are a few open rules:
Z 2h
h3
f (x) dx = 2hf (h) + f 00 (z),
0 3
Z 4h 14h5
4h
f (x) dx = 2f (h) − f (2h) + 2f (3h) + f (4) (z), and
0 3 45
Z 5h 95h5
5h
f (x) dx = 11f (h) + f (2h) + f (3h) + 11f (4h) + f (4) (z).
0 24 144

w(x) Rule
1|x|61 Gauss-Legendre
2
exp(−x ) Gauss-Hermite
106x<∞ exp(−x) Gauss-Laguerre
1|x|<1 (1 − x2 )−1/2 Gauss-Chebyshev (1st kind)
1|x|61 (1 − x2 )1/2 Gauss-Chebyshev (2nd kind)
Table 7.1: This table lists some weight functions and the corresponding families
of Gauss quadrature rules.
In each case the point z is inside the interval of integration, and the error term
assumes that the indicated derivative is continuous. The first one is simply the
midpoint rule after a change of variable to integrate over [0, 2h]. The next two
are from Davis and Rabinowitz (1984, Chapter 2.6). They both have the same
order. The last one avoids negative weights but requires an extra point.
Gauss rules
The rules considered above evaluate f at equispaced points. A Gauss rule takes
the more general form
Xm
IˆG = wi f (xi )
i=1
where both xi and wi can be chosen to attain high accuracy.
The basic panel for a Gauss rule is conventionally [−1, 1] or sometimes R,
and not [0, h] as we used for Newton-Cotes rules. Also the target integration
problem is generally weighted. That is, we seek to approximate
Z ∞
f (x)w(x) dx
−∞
for a weight function w(x) > 0. The widely used weight functions are multiples
of standard probability density functions, such as the uniform, gamma, Gaussian
and beta distributions; see Table 7.1. The idea is that having f be nearly a
polynomial can be much more appropriate than requiring the whole integrand
f (x)w(x) to be nearly a polynomial.
Choosing wi and xi together yields 2m parameters and it is then possible
to integrate polynomials of degree up to 2m − 1 without error. It follows that
if f (x) = p(x) where p is a polynomial with p(xi ) = f (xi ) for i = 1, . . . , m,
then IˆG (f ) = I(f ). Because of this, the Gauss rule is called an interpolatory
quadrature formula.
The error in a Gauss rule is
(b − a)2m+1 (m!)4 (2m)
f (z)
(2m + 1)[(2m)!]3

7.3. Higher order rules 11
m xi wi
1
2 ±√ 1
3
8
3 0
9
1√ 5
± 15
5 9
√ √
q
1 1
4 ± 525 − 70 30 18 + 30
35 36
√ √
q
1 1
± 525 + 70 30 18 − 30
35 36
128
5 0
225
√ √
q
1 1
± 245 − 14 70 322 + 13 70
21 900
√ √
q
1 1
± 245 + 14 70 322 − 13 70
21 900
Pm
Table 7.2: Gauss quadrature rules i=1 wi f (xi ) to approximate
R1
−1
f (x) dx, for m = 1, . . . , 5. From Weisstein, Eric W, “Legendre-
Gauss Quadrature.” MathWorld web site.—A Wolfram Web Resource.
http://mathworld.wolfram.com/Legendre-GaussQuadrature.html
where a < z < b, provided that f has 2m continuous derivatives. Unlike

Newton-Cotes rules, Gauss rules of high order have non-negative weights. We
could in principle use a very large m, and the result will still converge to I(f ), for
continuous f on [−1, 1]. Trefethen (2008) includes a short program to compute
a Gauss rule and considers m as large as 210 + 1.
For the uniform weighting w(x) = 1 we can easily break the region [−1, 1]
into panels. Then for n function evaluations the error will be O(n−2m ) assuming
as usual that f (2m) is continuous on [a, b]. Gauss rules for uniform weights on
[−1, 1] have the advantage that they can be used within panels. Several are
listed in Table 7.2.
Clenshaw-Curtis rules
Gauss rules for large n are expensive to construct. Also, the Gauss rules are
not generally nested, meaning that the points xi used at one value of m are not
necessarily used again for a larger value of m. Clenshaw-Curtis rules address
both of these difficulties. This second point makes them very well suited to
sparse grid methods for multidimensional integration, that we consider in §7.8.
As for Gauss rules, we consider integration over [−1, 1] and approximate the

integral I by
m
X
IˆCC = wi f (xi ).
i=1
The Clenshaw-Curtis rule evaluates f at

(i − 1)π
xi = cos , i = 1, . . . , m.
m−1
For large m, these points become more dense near the endpoints of the interval
[−1, 1]. The weights wi are chosen so that
m
X Z 1
wi xri = xr dx, r = 0, 1, . . . , m − 1.
i=1 −1
Thus, like the Gauss rule, Clenshaw-Curtis is interpolatory, though it is only

exact for polynomials of degree up to m − 1 as compared to degree 2m − 1 for
the Gauss rule.
If mk = 2k + 1 then
(i − 1)π (i0 − 1)π
xi = cos = cos , i0 = 2i − 1
2k 2k+1
and so xi appears again as a point in the Clenshaw-Curtis rule for mk+1 .
A short program to compute Clenshaw-Curtis integral estimate appears in
Trefethen (2008). It uses a fast Fourier transform to compute the weights, and
that paper includes examples with m as large as 220 + 1. Like Gauss rules, but
unlike Newton-Cotes, the Clenshaw-Curtis formulas have nonnegative weights
wi .
Trefethen (2008), citing numerous prior papers, considers Clenshaw-Curtis
rules to be almost as good as Gauss rules, despite having only about half of
the latter’s degree of polynomial correctness. In his numerical examples, the
Gauss rules were much better for high degree polynomials, but not for more
complicated functions such as |x|3 or |x + 1/2|1/2 on [−1, 1]. Where there was
a great difference, it was because the Gauss rules were exceptionally good, and
not because the Clenshaw-Curtis rules were very bad.
7.4 Fubini, Bahkvalov and curse of dimension

Classical quadrature methods are very well tuned to one-dimensional problems
with smooth integrands. A natural way to extend them to multi-dimensional
problems is to write them as iterated one-dimensional integrals, via Fubini’s the-
orem. When we estimate each of those one-dimensional integrals by a quadra-
ture rule, we end up with a set of sample points on a multi-dimensional grid.
Unfortunately, there is a curse of dimensionality that severely limits the accu-
racy of this approach.

7.4. Fubini, Bahkvalov and curse of dimension 13
To see informally what goes wrong, let f (x, y) be a function on [0, 1]2 . Now
write Z 1Z 1 Z 1
I= f (x, y) dx dy = I(y) dy
0 0 0
R1
where I(y) = 0 f (x, y) dx. Suppose that we use the same m point integration
rule with weight ui on point xi , for any value of y, getting
m
X
ˆ =
I(y) ui f (xi , y) = I(y) + E1 (y),
i=1
where |E1 (y)| 6 Am−r holds for some A < ∞ and all 0 6 y 6 1. The value of
r depends on the method we use and how smooth f is. Next we use an n point
rule to average over y getting
Xn Xn Z 1 n
X
Iˆ = ˆ j) =

vj I(y vj I(yj ) + E1 (yj ) = I(y) dy + E2 + vj E1 (yj )
j=1 j=1 0 j=1
where |E2 | 6 Bn−s for some B < ∞ and s depending on the outer integration
rule and on how smooth I(y) is.
The total error is a weighted sum of errors E1 at different points yj plus the
−r
error
PnE2 . We suppose that the weighted sum of E1 (yj ) is O(m ). This −r happens
if j=1 |vj | < C holds for all n because we assumed that |E1 (y)| 6 Am holds
with the same A < ∞ for all y ∈ [0, 1].
The result is that |Iˆ − I| = O(m−r + n−s ). In other words, the convergence
rate that we get is like the worst of the two one-dimensional rules that we
combine.
More generally, if we are using a d-dimensional grid of points and a product
rule
n1 X
X n2 nd Y
X d
Iˆ = ··· wjij f (xi1 , xi2 , . . . , xid )
i1 =1 i2 =1 id =1 j=1
we cannot expect the result to be better than what we would get from the worst
of the rules we have used. Suppose that rule j is the least accurate. Then Iˆ
could hardly be better than
nj
X
wji Ij (xji )
i=1
where Ij (xj ) is the (exact) integral of f over all variables other than xj .
Getting the worst one-dimensional rate leads to a curse of dimensionality.
Suppose that we use the same n point one-dimensional quadrature rule on each
of d dimensions. As a result we use N = nd function evaluations. If the one-
dimensional rule has error O(n−r ), then the combined rule has error
|Iˆ − I| = O(n−r ) = O(N −r/d ).
Even modestly large d will give a bad result. The value r is the smaller of the
number of continuous derivatives f has and the number that the quadrature

rule is able to exploit. Taking r d won’t help in practice, because high order
rules get very cumbersome and many of them are prone to roundoff errors.
This curse of dimensionality is not confined to sampling on grids formed as
products of one-dimensional rules. Any quadrature rule in high dimensions will
suffer from the same problem. Two important theorems of Bakhvalov, below,
make the point.
Theorem 7.2 (Bakhvalov I). For 0 < M < ∞ and integer r > 1, let
d
∂ r f (x)
X
r d
CM = f : [0, 1] → R | α1
6 M, ∀ αj > 0 with αj = r .
∂x · · · ∂xαd
1 d j=1
Then there exists k > 0 such that for any x1 , . . . , xn ∈ [0, 1]d and any w1 , . . . , wn ∈
r
R, there is an f ∈ CM with

X n Z
wi f (xi ) − f (x) dx > kn−r/d

[0,1]d
i=1
Proof. This is given as Theorem 3.1 in Dimov (2008).
Bakhvalov’s theorem makes high-dimensional quadrature seem intractable.

There is no way to beat the rate O(n−r/d ), no matter where you put your
sampling points xi or how cleverly you weight them. At first, this result looks
surprising, because we have been using Monte Carlo methods which get an
RMSE of O(n−1/2 ) in any dimension. The explanation is that in Monte Carlo
sampling we pick one single function f (·) with finite variance σ 2 and then in
sampling n uniform random points, get an RMSE of σn−1/2 for the estimate
of that function’s integral. Bahkvalov’s theorem works in the opposite order.
We pick our points x1 , . . . , xn , and their weights wi . Then Bakhvalov finds a
function f with r derivatives on which our rule makes a large error.
When we take a Monte Carlo sample, there is always some smooth function
for which we would have got a very bad answer. Such worst case analysis is very
pessimistic because the worst case functions could behave very oddly right near
our sampled x1 , . . . , xn , and the worst case functions might look nothing like
the ones we are trying to integrate. Lower bounds like Bahkvalov’s are often
constructed by choosing f equal to 0 at every xi we used and then maximizing
the integral of f subject to the constraints we have imposed on the partial
derivatives of f . The integrand or integrands we are really interested in might
not be so nasty.
Bakhvalov has a counterpart to Theorem 7.2 which describes random sam-
pling. Whatever way we choose to sample our input points, there exists a
smooth function with a large RMSE:
r
Theorem 7.3 (Bakhvalov II). For 0 < M < ∞ and integer r > 1, let CM
be as given in Theorem 7.2. Let x1 , . . . , xn be random elements of [0, 1]d and

7.5. Hybrids with Monte Carlo 15
r
let w1 , . . . , wn ∈ R. Then there exists k > 0 such that some function f ∈ CM
satisfies
n
X Z 2 1/2
E wi f (xi ) − f (x) dx > kn−1/2−r/d .
i=1 [0,1]d
Proof. This is given as Theorem 3.2 of Dimov (2008).
In Theorem 7.3, the worst case function f is chosen knowing how we will
sample xi , but not knowing the resulting values xi that we will actually use.
Here we see an RMSE of at least O(n−1/2−r/d ) which does not contradict the
MC rate. There is a curse of dimension only in the extent to which we can
improve on MC, namely a factor proportional to n−r/d .
7.5 Hybrids with Monte Carlo

We can hybridize quadrature and Monte Carlo methods by using each of them
on some of the variables. For example, to approximate
Z Z
I= f (x, y) dx dy
[0,1]d [0,1]s
we might take x1 , . . . , xm ∈ [0, 1]s , w1 , . . . , wm ∈ R and draw y1 , . . . , yn ∼

U(0, 1)d independently. Then the hybrid estimate is
n m
1 XX
Iˆ = wj f (xj , yi ). (7.8)
n i=1 j=1
Our hybrid has a curse of dimension driven by the size of s and it has variance
σ 2 /n where
m
X X m
m X
σ 2 = Var wj f (xj , y) = wj wj 0 Cov(f (xj , y), f (xj 0 , y)).
j=1 j=1 j 0 =1
We might expect this hybrid to be useful when s is not too large and f (x, y) is
very smooth in x. Then the inner sum in (7.8) is well handled by quadrature.
If additionally, f has large variance in x at any fixed value for y, then our
quadrature may be much better than using Monte Carlo on both x and y.
If we could
Pn integrate out x in closed form then we could use the estimate
Iˆ = (1/n) i=1 h(yi ) where h(y) = E(f (x, y) | y) for x ∼ U(0, 1)d . This is
the method called conditioning in §8.7. The hybrid (7.8) is conditioning with a
numerical approximation to h. The term preintegration is also used.
Hybrids of Monte Carlo and quasi-MontePCarlo methods are often used. See
n
Chapter 17. They take the form Iˆ = (1/n) i=1 f (xi , yi ) for a quadrature rule
with xi ∈ [0, 1]s and Monte Carlo samples yi ∼ U(0, 1)d .

7.6 Laplace approximations

The Laplace approximation is a classical device for approximate integration.
This section uses the statistical concept ofR Fisher information.
Suppose that we seek to estimate I = Rd f (x) dx and log(f (x)) has Taylor
expansion b + 12 (x − x∗ )T H(x − x∗ ) around its maximizer x∗ , where H is the
Hessian matrix for log(f (·)) at x = x∗ . If H is negative definite, then Σ = −H −1
is a covariance matrix and we may write
Z
1 T −1
I ≈ eb e− 2 (x−x∗ ) Σ (x−x∗ ) dx = eb (2π)d/2 |Σ|−1/2 (7.9)
where |Σ| is the determinant of Σ, using the known normalization of the N (x∗ , Σ)
distribution. Note that |Σ| = | − H −1 | which is the absolute value of 1/|H|.
The Laplace approximation is very accurate when log(f (x)) is smooth and
the quadratic approximation is good where f is not negligible. Such a phe-
nomenon often happens when x is a statistical parameter subject to the central
limit theorem, f (x) is its posterior distribution, and the sample size is large
enough for the CLT to apply.
If we want the integral of g(x)f (x) for smooth g, then we may multiply (7.9)
by g(x∗ ). The following theorem gives the details:
Theorem 7.4. The asymptotic equivalence

Z
g(x)e−λh(x) dx ∼ g(x∗ )(2π)d/2 |λH(x∗ )|−1/2 e−λh(x∗ )
A
holds as λ → ∞, if
i) A ⊆ Rd is open,
ii) A |g(x)|e−λh(x) dx < ∞ for all λ > λ0 for some λ0 < ∞,
R
iii) there exists x∗ ∈ A such that for all > 0
inf{h(x) − h(x∗ ) | x ∈ A, kx − x∗ k > } > 0,
iv) g is continuous in a neighborhood of x∗ with g(x∗ ) 6= 0, and

v) h has two continuous derivatives on A where H(x∗ ) ≡ ∂ 2 h(x)/∂x∂xT is
positive definite.
Proof. This is Theorem 4.14 of Evans and Swartz (2000) which is based on a
result in Wong (1989).
The parameter λ is a measure of how strongly spiked f is. In statistical

applications, λ is often the number n of observations in a sample. Similarly, the
argument x is usually a parameter vector θ in statistical applications, and f is
a distribution for θ. For the rest of this discussion we switch to the notation
used for Bayesian statistical applications.

7.6. Laplace approximations 17
Suppose that a parameter θ ∈ Θ ⊆ Rp has prior distribution π0 (θ), and the

data are independent draws from the density or mass function p(x; θ). We will
use X to represent the full collection of observations xi ∈ Rd , for i = 1, . . . , n.
The log likelihood function is
n
X
`(θ) = `(θ; X ) = log(p(xi ; θ))
i=1
and so the posterior distribution of θ given X is
π(θ) = π(θ | X ) ∝ π0 (θ)e`(θ) .
The value θ̂ = arg maxθ `(θ) is the maximum likelihood estimate of θ. When
θ̂ is interior to Θ and ` is smooth, then we have the Taylor approximation
.
`(θ) = `(θ̂) − 21 (θ − θ̂)T H(θ̂)(θ − θ̂) where
n
X ∂ 2 log(p(xi ; θ)) ˆ
H(θ) = − ≡ nI(θ).
i=1
∂θ∂θT
The quantity I(ˆ θ̂) is known as the empirical Fisher information for θ. Because
H is a sum of n independent and identically distributed terms, we find by the
law of large numbers that H(θ) ≈ −n E ∂ 2 log(p(x; θ))/∂θ∂θT ≡ nI(θ). This
I(θ) is the ordinary (non-empirical) Fisher information for θ. In asymptotic
expansions, it is most convenient to work with nI(θ̂), but to present the method
we work with H(θ̂).
The best predictor of g(θ), with respect to squared error, is
g(θ)π0 (θ)e`(θ) dθ
R
E(g(θ) | X ) = R ,
π0 (θ)e`(θ) dθ
which we may write as
g (θ)e−HN (θ) dθ
R
R N (7.10)
gD (θ)e−HD (θ) dθ
where, for example, gN (θ) is either 1 or g(θ) or g(θ)π0 (θ) with HN (θ) adjusted
accordingly, and similar choices are available for the denominator. We investi-
gate several specific formulations of (7.10) below.
If we take θ̂N and θ̂D to be the minimizers of HN and HD respectively, then
a straightforward use of the Laplace approximation yields
gN (θ̂N )|H̄N (θ̂N )|−1/2 e−HN (θ̂N )

E(g(θ)
b |X) = (7.11)
gD (θ̂D )|H̄D (θ̂D )|−1/2 e−HD (θ̂D )
where H̄N and H̄D are the Hessian matrices of HN and HD respectively.

The Laplace approximation is said to be in standard form when HN (θ) =

HD (θ). In the standard form, the estimate (7.11) simplifies to
gN (θ̂N ) gN (θ̂N )
E(g(θ)
b |X) = = (7.12)
gD (θ̂D ) gD (θ̂N )
because θ̂D = θ̂N in the standard form. If we take HN (θ) = HD (θ) = `(θ), so
that gN (θ) = g(θ)π0 (θ) and gD (θ) = π0 (θ), we obtain E(g(θ)
b | X ) = g(θ̂) where
θ̂ is the maximum likelihood estimator of θ. If instead, we take HN (θ) = `(θ) −
log(π0 (θ)), with gN (θ) = g(θ) and gD (θ) = 1, then we obtain E(g(θ)
b | X ) = g(θ),
e
where θ is the maximum a posteriori (MAP) estimate of θ.
e
The Laplace approximation is in fully exponential form when gN (θ) =
gD (θ). For example, we might have gN (θ) = gD (θ) = 1 with HN (θ) = `(θ) −
log(θ)−log(g(θ)) and HD (θ) = `(θ)−log(θ). The fully exponential form requires
g(θ) > 0. In the fully exponential form, the estimate (7.11) becomes
|H̄N (θ̂N )|−1/2 e−HN (θ̂N )

E(g(θ)
b |X) = . (7.13)
|H̄D (θ̂D )|−1/2 e−HD (θ̂D )
It now requires two separate optimizations and the determinants of two different
Hessian matrices. In return for this extra work, the method attains higher
accuracy. While the standard form has errors of order n−1 , the exponential
form has errors of order n−2 (Tierney and Kadane, 1986). This extra accuracy
arises because the numerator and denominator estimate their corresponding
quantities with very nearly the same relative error, and so much of that error
cancels.
It is possible to attain an error of order n−2 from the standard form. To do
so, we replace the numerator and denominator in the right hand side of (7.11)
by a Taylor expansion taken as far as third order mixed partial derivatives of
HN and HD . See Evans and Swartz (2000) for a formula. This process is less
convenient than the fully exponential one, especially for a large number p of
parameters in θ.
The positivity constraint on g(θ) from the fully exponential form is a nui-
sance. Tierney et al. (1989) consider several ways around it. One way is to
replace g(θ) by g(θ) + c for some c > 0, apply the fully exponential approxi-
mation, subtract c from the resulting estimate, and use the limit of this pro-
cess as c → ∞. Another is to work with the moment generating function
M (t) ≡ E(etg(θ) | X ). When M (t) exists we can estimate it using the fully ex-
ponential form because etg(θ) > 0. Then E(g(θ) | X ) = M 0 (0) which can be
estimated numerically.
The Laplace approximation is now largely, but not completely, overshad-
owed by Markov chain Monte Carlo (MCMC). See Chapter 11. One reason
is that the Laplace approximation is designed for unimodal functions. When
π(θ | X ) has two or more important modes, then the space Θ can perhaps be
cut into pieces containing one mode each, and Laplace approximations applied
separately and combined, but such a process can be cumbersome. MCMC by

7.7. Weighted spaces and tractability 19
contrast is designed to find and sample from multiple modes, although on some
problems it will have difficulty doing so. The Laplace approximation also re-
quires finding the optimum of a d-dimensional function and working with the
Hessian at the mode. In some settings that optimization may be difficult, and
when d is extremely large, then finding the determinant of the Hessian can be
a challenge. Finally, posterior distributions that are discrete or are mixtures of
continuous and discrete parts can be handled by MCMC but are not suitable
for the Laplace approximation.
The Laplace approximation is not completely superceded by MCMC. In par-
ticular, the fully exponential version is very accurate for problems with modest
dimension d and large n. When the optimization problem is readily solved
then it may provide a much more automatic and fast answer than MCMC does.
Furthermore, the Laplace approximation is much faster than MCMC.
The integrated nested Laplace approximation of Rue et al. (2009) is an al-
ternative to MCMC for large scale problems. It uses a weighted sum of Laplace
approximations. If the variables θ have components (η, φ) then it uses a Laplace
approximation to integrate over φ given η summed over a weighted set of val-
ues ηk for k = 1, . . . , K. It is thus a numerical integral over η values of a
Laplace approximation over φ given η. The statistical context behind η and
φ can be complex. See Palacios and Minin (2012) for an example in phylody-
namics, estimating historical sizes of populations from present day genetic data.
See Martino and Riebler (2019) for an introductory account.
7.7 Weighted spaces and tractability

Bahkvalov’s Theorem 7.2 establishes a curse of dimension. We cannot succeed
well on all f in those large classes F of high dimensional integrands covered by
his result. In many applications, however, one does successfully integrate a high
dimensional function. That does not contract Bahkvalov in any way. His curse
of dimension remains intact, and we learn that the given problem was not like
the worst cases. Weighted space models describe classes of functions in which
integration is easier than the classes Bahkvalov studied. Under those additional
assumptions, the curse of dimension can be augmented with some tractability
results presented here.
The function classes in Bahkvalov’s theorems are very broad and are defined
only in terms of which partial derivatives exist. While a user’s given integrand f
might well belong to one of those classes, those classes include lots of functions
that in no way resemble f . Then worst case accuracy over F might not be an
accurate description of what happens for f . There are also ways to measure
average performance over these infinite classes F, but there again, the average
might not be a good description of our problem. If our f is differentiable but
F includes all continuous functions, then the non-differentiable ones might ‘out
vote’ ours in the average error measure. If we take close account of the precise
number of derivatives, as Bahkvalov does, then our f might still be in some
out-voted subset in some other way as described next.

One well-studied sort of problem where high dimensional integration suc-

ceeds has integrands f (x1 , . . . , xd ) that depend in ever weaker ways on xj as
the index j increases. That is, the inputs are not equally important. The in-
tegrands may also depend mostly on the inputs xj one or two at a time, and
hardly at all on combinations of many more inputs. Some tractability results
presented below, give conditions under which a curse of dimension might not
apply, through further assumptions about how f depends on x.
To study these properties, write
X
f (x) = fu (x) (7.14)
u⊆{1,2,...,d}
where the function fu (x) only depends on those xj for j ∈ u. The ANOVA
decomposition of Appendix A represents f this way, and there are other decom-
positions.
Using (7.14), we may bound the integration error over [0, 1]d by
n
X 1 X Z
|Iˆ − I| 6

n fu (x i ) − fu (x) dx.
[0,1] d
u i=1
If some of the functions fu are always close to zero or are otherwise easy to
integrate, then they might not contribute much to Iˆ − I. In a weighted space
analysis, presented below, we introduce weights γu > 0 for each subset u of
inputs to f . Larger weights γu will allow fu to be a more complicated and
harder to integrate function, while smaller weights γu will force fu to be closer
to zero. Other things being equal, γu is usually chosen to be a decreasing
function of the number of variables included in u (so γ{1,2,3} < γ{1,2} ) and also
of the indices of those variables (so γ{1,2,3} < γ{3,4,5} ). All 2d values γu together
d
are represented by γ ∈ (0, ∞)2 .
To shorten some expressions, we introduce notation 1:d for {1, 2, . . . , d}. For
x ∈ Rd and u ⊆ 1:d we let xu be made up of entries xj with j ∈ u. We also use
∂ |u|
∂ u f (x) ≡ Q f (x)
j∈u ∂xj
with ∂ ∅ f = f , where |u| is the cardinality of u. In a weighted space model, we

suppose that f belongs to the function class
Fγ = {f | kf kγ 6 1}, where
X 1 Z Z 2
kf k2γ = ∂ u f (x) dx−u dxu . (7.15)
γu [0,1]|u| [0,1]d−|u|
u⊆1:d
If kf kγ = c > 1, then f /c ∈ Fγ . Therefore error bounds for Fγ apply to f after

rescaling by c. It is usual to include in Fγ the limits of any convergent sequence
of functions from Fγ , though we do not pursue that point here.

7.7. Weighted spaces and tractability 21
We can use the ANOVA decomposition to simplify equation (7.15). If ∂ 1:d f

is continuous on [0, 1]d and fu is from the ANOVA decomposition, then after
some simplifications
X 1 Z
kf k2γ = |∂ u fu (x)|2 dxu . (7.16)
γu [0,1]|u|
u⊆1:d
See Exercise 7.6. For example, the contributions of u = {j} and u = {j, k} are
Z 1 2 Z 1 Z 1 2
1 ∂ 1 ∂2
f{j} (x) dxj and
f{j,k} (x) dxj dxk ,
γ{j} 0 ∂xj
γ{j,k} 0 0 ∂xj ∂xk

respectively. The first one takes a mean squared slope of the main effect f{j}
for variable j and penalizes it by 1/γ{j} . The larger γ{j} is, the less that term
contributes to kf kγ and by that measure, the more important xj can be, without
putting f over the complexity budget kf kγ 6 1. For u = {j, k}, it is the mean
squared second order mixed partial that is penalized by 1/γ{j,k} .
Recall that Bahkvalov’s theorems 7.2 and 7.3 measure difficulty through
mixed partial derivatives. It is not just that f changes with xj that complicates
a problem, but also that the nature of this change varies with xk for k 6= j and so
on for other indices. The criterion in (7.16) sharply constrains how complicated
fu can be when γu is small. The smaller γu is, the smaller |∂ u fu |2 must be on
average, for f to fit into Fγ . Q
It is very common to study product weights with γu = j∈u γj where 1 >
γ1 > γ2 > · · · > γd > 0. Then sets u with larger cardinality are modeled as no
more important, and so are sets u containing larger indices j as described above.
Popular choices include γj = j −η for a parameter η > 0. Some additional weight
choices are described in the end notes.
Tractability is a way to describe the integration problem not getting too
much harder as d → ∞. For product weights, we can consider d → ∞ by using
the first d values of γj . Letting X1:n denote all the points x1 , . . . , xn ∈ [0, 1]d ,
the worst case error in Fγ is
1 n
X Z

en,d (X1:n , Fγ ) = sup
f (xi ) − f (x) dx.
f ∈Fγ n i=1 [0,1]d
If we had no points at all to use, then our best estimate is Iˆ = 0 and we would
incur the initial error
Z Z

e0,d (Fγ ) = sup f (x) dx = sup |f (x)| dx.
f ∈F γ d[0,1] f ∈F γd [0,1]
Sloan and Woźniakowski (1998) let n = nγ (ε, d) be the minimal number of

function values necessary to reduce the P
initial error by a factor of ε > 0, when
R n
we approximate [0,1]d f (x) dx by (1/n) i=1 f (xi ). At this n, there exist some
collection X1:n of points xi for which
en,d (X1:n , Fγ ) 6 ε × e0,d (Fγ ).

For some P
weighted spaces this minimal number grows rapidly with dimension.
d
Let Γd = j=1 γj . They show that
nγ (ε, d) > (1 − ε2 )1.055Γd .

Then, in an equally weighted space with all γj = 1, nγ (ε, d) grows at least ex-
ponentially fast with d. Some settings with decreasing γj avoid this exponential
cost and are considered tractable by the definitions below.
Definition 7.1. The problem of integrating f ∈ Fγ is tractable if
nγ (ε, d) 6 Cdq ε−p
for positive constants C, p and q, independent of d > 1 and ε > 0.
Definition 7.2. The problem of integrating f ∈ Fγ is strongly tractable if
nγ (ε, d) 6 Cε−p
for positive constants C and p, independent of d > 1 and ε > 0.
Both definitions 7.1 and 7.2 describe problems where the initial error can
be reduced without incurring exponential cost. Tractability can be established
by proving that the necessary points x1 , . . . , xn ∈ [0, 1]d exist. Sometimes
those results do not indicate how to find any such points. Other proofs are
constructive, meaning that they can be used to produce points xi with which
to compute I.ˆ
PSloan
∞
and Woźniakowski (1998) show that integration is strongly tractable
if j=1 γj < ∞ but their proof is non-constructive and since it has p = 2, it is
the rate we might expect from plain Monte Carlo. Taking γj = j −η for η > 1
P∞ 1/2
suffices. Hickernell and Woźniakowski (2000) show that when j=1 γj < ∞,
then for any δ ∈ (0, 1), one can attain strong tractability with p = 1/(1 − δ),
so that the initial error can be reduced almost as fast as n−1 , though their
result is also non-constructive. Taking γj = j −η for η > 2 suffices. Nuyens
and Cools (2006a,b) give constructions for weighted spaces. Their algorithm
costs O(dn log(n)) to construct the points, a marked improvement over earlier
methods. Their constructions are lattice rules, a kind of quasi-Monte Carlo.
Chapters 15, 16, and 17 are on quasi-Monte Carlo.
In favorable weighted spaces, the error drops rapidly with n for all d. The
rates from Bahkvalov’s theorems have an n−r/d factor in them. The most fa-
vorable weighted space rates are almost O(n−1 ), comparable to the Bahkvalov
bounds for r = d (or r = d/2 for random inputs).
When our function f is not really dominated by low dimensional terms, then
the rapid decay of initial error might not be enough to ensure an accurate high
dimensional integration. It is not enough for f to have continuous mixed partial
derivatives of all orders. Consider
d
Y √
fd (x) = 12(xj − 1/2). (7.17)
j=1

7.8. Sparse grids 23
This function has integral 0 over [0, 1]d . It has variance 1 for x ∼ U[0, 1]d , and
in that sense it is equally hard to integrate by plain Monte Carlo for any d.
From (7.15),
12d
kf kγ = (7.18)
γ1:d
See Exercise 7.8. With γj = j −η , we get kf kγ = (d!)η 12d . Now f /c ∈ Fγ for
c = (d!)η 12d , and so the error in integrating f /c can be brought below ε at a
cost of n = Cε−p function evaluations. The error in integrating f is then below
c × ε = 12d (d!)η ε
at that n. Now suppose that we want to ensure that |Iˆ − I| = |I|
ˆ 6 . Then we
require

ε6 d and n > C−p 12dp (d!)ηp .
12 (d!)η
In this example, larger d requires greater rescaling of fd to fit into Fγ , leading
to a larger initial error, a greater reduction ε in the initial error to attain error
, and finally a required sample size that grows rapidly with d. For functions
like fd that are not really dominated by low dimensional terms, the curse of
dimension can reappear within the initial error.
7.8 Sparse grids

Sparse grids are another way to integrate multidimensional functions at much
less than the exponential cost of using a full d-dimensional grid of evaluation
points. They are also known as Smolyak rules because they were first proposed
by Smolyak (1963).
With sparse grids, we evaluate the integrand only at a tiny subset of the
points in a full grid. See Figure 7.1 for examples of some evaluation points for
the case d = 2. In this section, we present some of the main ideas behind sparse
grids but defer most of the details to the cited references.
We begin with an infinite sequence of one-dimensional quadrature rules
mk
X
Qk (f ) = wi,k f (xi,k ), k > 1.
i=1
For instance, for integration over [0, 1], we might take Q1 (f ) = f (1/2) and then
for k > 2, use a trapezoid rule
2k−1
X−1 i 1
1 1
Qk (f ) = k f (0) + f k−1 + f (1) ,
2 2 i=1
2 2
with mk = 2k−1 + 1. There are many other choices. Choosing Q1 (f ) = f (1/2)

and, for k > 2, rescaling the Clenshaw-Curtis rule from [−1, 1] to [0, 1] is ef-
fective. That rescaling involves replacing nodes x ∈ [−1, 1] by (x + 1)/2 and

Some sparse grid points in the square

● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ●
●●●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●
● ●●
●
●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●
●●
●
●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●●●●
●
●
Figure 7.1: Each panel shows a sparse grid in [0, 1]2 , based on univariate
Clenshaw-Curtis rules. The corresponding quadrature rules are unequally
weighted sums of function values over these sparse grids of points.
replacing weights w by w/2. It will prove important to have m1 = 1 and it

is best to have nested rules, in which any point xi,k appears again as a point
xi0 ,k+1 for some i0 .
If f has the smoothness needed by our quadrature method, then
Z 1
lim Qk (f ) = f (x) dx.
k→∞ 0
Now, for k > 1, let ∆k (f ) = Qk (f ) − Qk−1 (f ), defining Q0 (f ) = 0. Then taking

a telescoping sum
X∞ Z 1 X`
∆k (f ) = f (x) dx = lim ∆k (f ). (7.19)
0 `→∞
k=1 k=1
d
For f (x) defined on [0, 1] , we define
mk1 mkd
d
X XY
Qk1 ⊗ Qk2 ⊗ · · · ⊗ Qkd (f ) = ··· wij ,kj f (xi1 ,k1 , . . . , xid ,kd ). (7.20)
i1 =1 id =1 j=1

R
Equation (7.20) approximates [0,1]d f (x) dx using a weighted sum of function
Qd
evaluations on a grid of j=1 mkj points. We similarly define ∆k1 ⊗ · · · ⊗ ∆kd .
For example, with d = 2,
∆k1 ⊗ ∆k2 = (Qk1 − Qk1 −1 ) ⊗ (Qk2 − Qk2 −1 )

= Qk1 ⊗ Qk2 − Qk1 −1 ⊗ Qk2 − Qk1 ⊗ Qk2 −1 + Qk1 −1 ⊗ Qk2 −1 .
In the multidimensional case, the telescoping sum becomes

∞
X ∞
X Z
··· ∆k1 ⊗ · · · ⊗ ∆kd (f ) = f (x) dx. (7.21)
k1 =1 kd =1 [0,1]d
Let k = (k1 , . . . , kd ) and ∆k = ∆k1 ⊗ · · · ⊗ ∆kd , and define

d
X
kkk∞ = max kj and kkk1 = kj .
16j6d
j=1
Sampling on a regular md` grid, we can approximate I by

X
∆k (f ), (7.22)
k: kkk∞ 6`
which we know to be expensive and ineffective for large d.

For sparse grids we use
X
Q`,d (f ) = ∆k (f ), ` > 1 (7.23)
k: kkk1 6`+d−1
instead. Then, the first rule is
Q1,d (f ) = ∆(1,...,1) (f ) = Q1 ⊗ Q1 ⊗ · · · ⊗ Q1 (f )
which requires md1 evaluations of f . There is thus a strong advantage to having

m1 = 1. If we also have x1,1 = 1/2, then
1 1
Q1,d (f ) = f ,..., ,
2 2
which seems to be the most reasonable one point rule. In (7.23), we compute
Q`,d (f ) via ∆k which is built up from differences ∆k = Qk − Qk−1 . When
the points of Qk−1 also appear in Qk , then we only have to evaluate f at the
mk − mk−1 new points of Qk in order to apply ∆k .
Let N`,d be the number of points evaluation points x ∈ [0, 1]d used in the
sparse grid estimate (7.23). Then
X d
Y
N`,d 6 mkj
k: kkk1 6`+d−1 j=1

` m` N`,8 m8` m` log2 (m` )7 /N`,8

1 1 1 1 0.0
2 3 17 6.6 × 103 4.4
3 5 145 3.9 × 105 13.0
4 9 849 4.3 × 107 34.0
5 17 3,937 7.0 × 109 82.0
6 33 15,713 1.4 × 1012 170.0
Table 7.3: Example sparse grid sizes for d = 8. The third column is from Table
3 of Gerstner and Griebel (1998). The fourth column is the approximate size of
a non-sparse grid. The final column divides an asymptotic estimate for N`,8 by
its value.
and is generally less than that because the formula above counts some of the
input points multiple times. The widely used sparse grid points, such as those
based on Clenshaw-Curtis rules, have mk = O(2k ). Then
N`,d = O(2` `d−1 ) = O(m` log2 (m` )d−1 )
compared to md` for the non-sparse grid (7.22) (Gerstner and Griebel, 1998).
Table 7.3 shows an example for d = 8. Lemma 1 of Müller-Gronbach (1998)
gives a sharper result, that
N`,d
lim = 1,
`→∞ 2`−d (` − 1)d−1 /(d − 1)!
for fixed d, though this asymptotic equivalence may be slow to take hold.
The sparse grid estimate can be written
N`,d
X
Q`,d (f ) = Wi f (xi )
i=1
where the sum is over all N`,d points xi (ordered somehow) in the sparse grid
and for each of them Wi is the total of all the weights applied to xi from all of the
∆k that include it. The bookkeeping underlying Wi is somewhat complicated.
See Gerstner and Griebel (1998). Some of these Wi can be negative, even
when all of the Qk have only non-negative wi,k , P owing to the differences Qk −
Qk−1 appearing in the formula for Q`,dP . While i Wi = 1, the appearance of
negative weights can potentially make i |Wi | much larger than one, affecting
the numerical stabilityPof the sparse grid estimate. Novak and Ritter (1997,
Lemma 2) show that i |Wi | = O(log(N`,d )d−1 ), even for non-nested sparse
grid rules. The implied coefficient of log(N`,d )d−1 depends on d. Gerstner and
Griebel (1998) describe a recursive computation of Q`,d that reduces roundoff
error.
To describe the accuracy of sparse grids with the Clenshaw-Curtis quadra-
ture rules, we present results from Novak and Ritter (1997). For α = (α1 , . . . , αd )

with integer αj > 0, let
∂ kαk1 f (x)
f (α) (x) = αd .
∂xα
1 · · · ∂xd
1
Now, consider these two sets of functions, for r > 1 and d > 1,
Cdr = {f : [0, 1]d → R | max kf (α) k∞ 6 1} and (7.24)

kαk1 6r
Fdr = {f : [0, 1]d → R | max kf (α) k∞ 6 1}. (7.25)

kαk∞ 6r
Novak and Ritter (1997) defined functions on [−1, 1]d but that difference will
not affect the results we see here. The sets above are the unit balls in the
function classes that they define. For d > 1, the unit ball Fdr constrains more
partial derivatives of f than Cdr does, and so Fdr ( Cdr .
Let Q`,d be a sparse grid quadrature using the Clenshaw-Curtis rules and
let N = N`,d be the number of points in it. Novak and Ritter (1997) show that
there are finite constants cd,r such that
sup |Q`,d (f ) − I(f )| 6 cr,d N −r (log N )(d−1)(r+1) , and (7.26)

f ∈Fdr
sup |Q`,d (f ) − I(f )| 6 cr,d N −r/d (log N )(d−1)(r+1) . (7.27)

f ∈Cdr
The results for Cdr are within logarithmic factors of the bounds in Bahkvalov’s
theorem. Under the stricter conditions described by Fdr , we see that a much
better rate can be attained. When using sparse grids with Clenshaw-Curtis, we
do not have to know the largest r for which f can be scaled to belong to Cdr
or Fdr . The error satisfies (7.26) for all of those r even though the method is
defined without regard to r.
The rate (7.26) does not set in until N is large enough to make use of r’th
derivative information in the sample. As N increases, ever larger values of r
become relevant. For fixed r the bound in (7.26) decreases linearly versus N on
a log-log plot. With the effective value of r increasing with N , we can get an
ever steepening concave plot of error versus N in a log-log plot. Holtz (2010,
Chapter 4.2) shows some examples of this phenomenon for very smooth f with
d < 10.
Gerstner and Griebel (1998) include several numerical examples of sparse
grid quadrature. In some of their examples, using the Clenshaw-Curtis rule
is much better than the trapezoid rule, while in others the results are close.
Similarly, in some of their examples a great improvement arises from using a
Gauss rule instead of Clenshaw-Curtis, despite the larger sample sizes required
from a non-nested base rule. In other examples the method based on Clenshaw-
Curtis is almost equally effective as using a Gauss rule.
The cost of sparse grids eventually grows rapidly with dimension. Suppose
that we want to include points that differ from 1/2 in s or more of their d
components. Then we must have ` > s + 1. Estimating N`,d by 2` `d−1 we see

that it will become very expensive to have large d with even modest `. The
estimate 2`−d (` − 1)d−1 /(d − 1)! from Müller-Gronbach (1998) is more favorable
but we must recall that it applies to ` → ∞ for fixed d, not d → ∞ for fixed `.
Delayed basis sequences, introduced by Petras (2003) are a method to cope
with the high cost of sparse grids in high dimensions. Given a sequence of sample
sizes m1 , m2 , m3 , . . . , a delayed sequence might use sizes m̃k from a sequence
like
m1 , m2 , m2 , m3 , m3 , m3 , m4 · · · .
The resulting sizes m̃k increase slowly with k. When two consecutive rules Qk−1
and Qk are identical, then ∆k = 0 and need not be computed. Holtz (2010) has
several examples.
Chapter end notes

Davis and Rabinowitz (1984, Chapters 2–4) have a very comprehensive discus-
sion of one-dimensional quadrature rules. The emphasis is on problems where
one can obtain highly accurate answers. They give special attention to integra-
tion over unbounded intervals and to integrands that oscillate. Their Chapter
2.12 covers unbounded integrands over bounded intervals. Evans and Swartz
(2000) describe many multivariate quadrature methods.
Clenshaw-Curtis rules and their history are described in Trefethen (2008).
An almost identical proposal was made earlier by Fejér (1933). They were pre-
sented by Clenshaw and Curtis (1960). For k > 2, the points xi of a Clenshaw-
Curtis rule on m = 2k−1 + 1 points are the points at which Chebychev polyno-
mials of the first kind, Tm−1 (x) = cos((m − 1) arccos(x)), attain their extrema
on [−1, 1].
Adaptive integration
Press et al. (2007) present some adaptive quadrature methods. The idea is to
take a rule like the midpoint rule and use it with a higher density of points in
some subintervals than others. For example, subintervals where the function
varies more, such as having a larger |f 00 |, get more points while other intervals
get fewer points. The information on where the function varies most is gathered
along with the function values. They also present some multivariate adaptive
quadrature methods.
There is some mild controversy about the use of adaptive methods. There are
theoretical results showing that adaptive methods cannot improve significantly
over non-adaptive ones. There are also theoretical and empirical results showing
that adaptive methods may do much better than non-adaptive ones. These
results are not contradictory. Instead, they make different assumptions about
the problem. For a survey of conditions when adaptation helps, see Novak
(1996).
Hickernell et al. (2013) present a two stage adaptive Monte Carlo method to
estimate a mean to within a guaranteed error size with a guaranteed confidence

End Notes 29
level. The guarantees depend on an assumption that

Z
(f (x) − I(f ))4 p(x) dx 6 Bσ 4 (f ) (7.28)
Rd
where σ 2 (f ) = Rd (f (x) − I(f ))2 p(x) dx, for some reasonable upper bound B.
R
Here p(x) is a probability density function from which they sample. Condi-
tion (7.28) defines a ‘cone’ of functions in the sense that if f satisfies it then so
does cf for c > 0. It is not necessarily convex because (f (x) + g(x))/2 might
not satisfy (7.28) even if both f and g do. Convexity of this type is a com-
mon assumption in theorems that show adaptive methods cannot improve on
non-adaptive ones. As a result, those theorems do not apply to that two stage
algorithm.
Hickernell and Rugama (2016) present an adaptive quasi-Monte Carlo al-
gorithm with guaranteed accuracy under an assumption that the integrand’s
coefficients in a Walsh basis expansion decay in a reasonably, though not nec-
essarily perfectly, monotone way.
Weighted spaces
Hickernell (1998) used weighted spaces to describe problems where multivariable
interactions have decreasing importance as the number of variables involved
increases. His weights took the form γu = γ |u| for 0 < γ < 1, and they serve
to make higher order dependence in f less important. Sloan and Woźniakowski
(1998) added a dependence on the index of the variable and introduced the
product weights described in §7.7. If f ∈ FRγ for product weights with γj = j −η
for η > 0, then f from (7.16), we
R find that (∂ u fu (x))2 dx cannot be very large
2
for large |u|. It follows that fu (x) dx cannot be very large either (Owen,
2019), providing a sense in which those spaces have small effective dimension.
Q
Product and order weights (POD weights) take the form γu = G|u| j∈u γj ,
for some positive constants G|u| . These POD weights were developed to solve
integration problems arising from partial differential equations with random
coefficients. See Kuo et al. (2012). Some additional choices for weights are
included in the survey of quasi-Monte Carlo methods given by Dick et al. (2013).
There have been a few papers about choosing the weighted space for a given
application. Wang and Sloan (2006) say that usually tuned lattice rules consider
higher order interactions to be more important than lower ones and don’t nec-
essarily do better than digital nets (of Chapter 15) on finance problems. Their
idea to tune weights is to choose them so that typical functions have similar
sensitivity indices to the given function. L’Ecuyer and Munger (2012) estimate
weights via γu = σu2 from an ANOVA decomposition.
Wang and Sloan (2007) note two difficulties with weighted spaces: it can be
hard to choose the weights, and, sometimes the given function is not smooth
enough to be in the weighted space. They propose a method of picking weights
proportional to γu = 6|u|/2 wu (f ) where
Z Z 2
wu (f ) = ∂ u f (x) dx−u dxu
[0,1]|u| [0,1]d−|u|

is the quantity appearing for subset u in kf k2γ .

Equation (7.15) is a squared ‘unanchored norm’. There is a corresponding
anchored norm version. Let xu :1−u ∈ [0, 1]d be the point y with yj = xj for
j ∈ u and yj = 1 for j 6∈ u. Then the anchored version has
X 1 Z
kf k2γ = |∂ u f (xu :1−u )|2 dxu . (7.29)
γu [0,1]|u|
u⊆1:d
The analysis of integration in weighted spaces is very dependent on repro-

ducing kernel Hilbert space methods (Berlinet and Thomas-Agnan, 2011) which
are beyond the prerequisites assumed for this book. Dick and Pillichshammer
(2010, Chapter 2) provide a concise introduction geared to integration. More
results are in Ritter (2007) and for a comprehensive treatment see Novak and
Woźniakowski (2008, 2010, 2012), especially volume II. Those methods allow
one to construct the worst case integrand f for given inputs x1 , . . . , xn ∈ [0, 1]d
and given weights γ.
Sparse grids
Sparse grids were introduced by Smolyak (1963). They also go by the names ‘hy-
perbolic cross points’, ‘Boolean method’ and ‘discrete blending method’. Bun-
gartz and Griebel (2004) is a comprehensive reference on sparse grid methods.
Gerstner and Griebel (1998) focus on their use in numerical integration and
Holtz (2010) considers applications to finance.
Pd
The sparse grids presented in §7.8 retained terms ∆k (f ) for j=1 kj 6
` + d − 1. This treats all Pd input dimensions equally. One can instead work
adaptively using k with j=1 vj kj below some bound where vj > 0 is larger
for more important variables j and is smaller for less important ones. It is also
possible to RuseR different quadrature rules on each variable as one might when
1 ∞
estimating 0 0 f (x1 , x2 ) exp(−x2 ) dx1 dx2 . See Gerstner and Griebel (1998).
Holtz (2010) finds that sparse grids are extremely effective on integrands that
are very smooth and are dominated by low dimensional terms in an ANOVA.
He remarks that they are not very robust to non-smoothness of the integrands.
It is possible to use the sparse grid construction on s variables at a time. If
we wantPmto integrate f : [0, 1]ds → R then we P can replace the basic integration
mk s
rules k
i=1 f (x i,k ) for x i,k ∈ [0, 1] by rules i=1 f (xi,k ) with xi,k ∈ [0, 1] .
See Dick et al. (2007) who base a sparse grid rule on quasi-Monte Carlo and
randomized quasi-Monte Carlo rules.
Exercises
7.1. A test for Churg-Strauss syndrome (Masi et al., 1990) correctly detects
it in 99.7% of affected patients. The test falsely reports Churg-Strauss in 15%
of patients without the disease. Suppose that we sample m people at a clinic
and x of them test positive for Churg-Strauss. We are interested in the fraction

Exercises 31
p ∈ (0, 1) of visitors to the clinic that really have Churg-Strauss. For a uniform
prior distribution on p the posterior distribution of p is proportional to
πu (p | x) = (p × 0.997 + (1 − p) × 0.15)x ((1 − p) × 0.85 + p × 0.003)m−x .
The clinic finds that x = 10 out of m = 30 patients test positive.

a) Use the midpoint rule with n = 1000 to estimate the posterior mean of p
given the data: R1
πu (p | x)p dp
E(p | x) = R0 1 .
0 u
π (p | x) dp
b) Use the midpoint rule with n = 1000 to estimate the posterior variance
R1
0
πu (p | x)(p − E(p | x))2 dp
Var(p | x) = R1 .
0 u
π (p | x) dp
c) One third of the patients tested positive. Use the midpoint rule with
n = 1000 to estimate the posterior probability that p 6 1/3,
R 1/3
πu (p | x) dp
P(p 6 1/3 | x) = R0 1 .
0
πu (p | x) dp
7.2. Our theoretical understanding of the midpoint rule suggests that the error
in the first two parts of Exercise 7.1 should decrease as O(1/n2 ). The third part
should not attain this rate because f (p) = 1p61/3 is not twice differentiable.
Use the midpoint rule with n = 106 as if it were the exact answer. Then plot
the absolute error versus n of the midpoint rule for n = 10j for j = 1, 2, 3, 4, 5,
for all three of the parts of Exercise 7.1. Does the predicted n−2 rate hold for
the first two parts? What rate appears to hold in the third part?
7.3. Solve the system of equations
Z n n
X
xp dx = win in , p = 0, . . . , n
0 i=0
for win , for n = 1, 2, 3, 4, 5, 6. Use your results to give the next symmetric rule
after Bode’s rule.
R1
7.4. Mike uses the midpoint rule with n points to approximate 0 f (x) dx by Iˆn ,
and Trish uses the trapezoid rule with the same intervals on the same problem
to get Ien .
a) How many function values does Trish use?
b) How many distinct function values did the two people need?
c) Describe how they could combined their data to fit a larger midpoint rule.

d) If f is very smooth, do you expect differences in accuracy among choices

(Iˆ + I)/2,
e (2Iˆ − I) ˆ
e and (2Ie − I)?
7.5. Verify that equation (7.5) is correct for f (x) = x3 . Show that it fails for
f (x) = x4 .
7.6. Prove equation (7.16).
7.7. The function fd in (7.17) has mean 0 and variance 1, for any dimension
d > 1. It does however cost O(nd) computation to evaluate it n times, and in
that sense it becomes harder to integrate by Monte Carlo as d increases.
Qd
a) Construct a sequence of functions gd (x) = j=1 hj (xj ) on [0, 1] so that
E(gd (x)) = 0 and E(gd (x)2 ) = 1 for x ∼ U[0, 1]d .
b) Find kf kγ for your functions gd , when γj = j −η for η > 0.
7.8. Prove equation (7.18).

Bibliography
Berlinet, A. and Thomas-Agnan, C. (2011). Reproducing kernel Hilbert spaces

in probability and statistics. Springer Science & Business Media, New York.
Bungartz, H.-J. and Griebel, M. (2004). Sparse grids. Acta numerica, 13:147–
269.
Clenshaw, C. W. and Curtis, A. R. (1960). A method for numerical integration
on an automatic computer. Numerische Mathematik, 2(1):197–205.
Davis, P. J. and Rabinowitz, P. (1984). Methods of Numerical Integration.
Academic Press, San Diego, 2nd edition.
Dick, J., Kuo, F. Y., and Sloan, I. H. (2013). High-dimensional integration: the
quasi-Monte Carlo way. Acta Numerica, 22:133–288.
Dick, J., Leobacher, G., and Pillichshammer, F. (2007). Randomized Smolyak
algorithms based on digital sequences for multivariate integration. IMA jour-
nal of numerical analysis, 27(4):655–674.
Dick, J. and Pillichshammer, F. (2010). Digital sequences, discrepancy and
quasi-Monte Carlo integration. Cambridge University Press, Cambridge.
Dimov, I. T. (2008). Monte Carlo methods for applied scientists. World Scien-
tific, Singapore.
Evans, M. J. and Swartz, T. (2000). Approximating integrals by Monte Carlo
and deterministic methods. Oxford University Press, Oxford.
Fejér, L. (1933). Mechanische quadraturen mit positiven cotesschen zahlen.
Mathematische Zeitschrift, 37(1):287–309.
Gerstner, T. and Griebel, M. (1998). Numerical integration using sparse grids.
Numerical algorithms, 18(3–4):209–232.
33
34 Bibliography
Hickernell, F. J. (1998). A generalized discrepancy and quadrature error bound.

Mathematics of Computation, 67:299–322.
Hickernell, F. J., Jiang, L., Liu, Y., and Owen, A. B. (2013). Guaranteed
conservative fixed width confidence intervals via Monte Carlo sampling. In
Dick, J., Kuo, F. Y., Peters, G., and Sloan, I. H., editors, Monte Carlo and
Quasi-Monte Carlo Methods 2012, pages 105–128. Springer, Berlin.
Hickernell, F. J. and Rugama, L. A. J. (2016). Reliable adaptive cubature using
digital sequences. In Cools, R. and Nuyens, D., editors, Monte Carlo and
Quasi-Monte Carlo Methods 2014, pages 367–383. Springer, Cham, Switzer-
land.
Hickernell, F. J. and Woźniakowski, H. (2000). Integration and approximation
in arbitrary dimensions. Advances in Computational Mathematics, 12:25–58.
Holtz, M. (2010). Sparse grid quadrature in high dimensions with applications
in finance and insurance. Springer-Verlag, Berlin.
Kuo, F. Y., Schwab, C., and Sloan, I. H. (2012). Quasi-Monte Carlo finite ele-
ment methods for a class of elliptic partial differential equations with random
coefficients. SIAM Journal on Numerical Analysis, 50(6):3351–3374.
Lubinsky, D. S. and Rabinowitz, P. (1984). Rates of convergence of Gaus-
sian quadrature for singular integrands. Mathematics of Computation,
43(167):219–242.
L’Ecuyer, P. and Munger, D. (2012). On figures of merit for randomly-shifted
lattice rules. In Plaskota, L. and Woźniakowski, H., editors, Monte Carlo and
Quasi-Monte Carlo Methods 2010, pages 133–159. Springer.
Martino, S. and Riebler, A. (2019). Integrated nested Laplace approximations
(INLA). Technical Report arXiv:1907.01248, Norwegian University of Science
and Technology.
Masi, A. T., Hunder, G. G., Lie, J. T., Michel, B. A., Bloch, D. A., Arend, W. P.,
Calabrese, L. H., Edworthy, S. M., Fauci, A. S., Leavitt, R. Y., Lightfoot Jr.,
R. W., McShane, D. J., Mills, J. A., Stevens, M. B., Wallace, S. L., and
Zvaifler, N. J. (1990). The American College of Rheumatology 1990 criteria
for the classification of Churg-Strauss syndrome (allergic granulomatosis and
angiitis). Arthritis & Rheumatism, 33(8):1094–1100.
Müller-Gronbach, T. (1998). Hyperbolic cross designs for approximation of
random fields. Journal of statistical planning and inference, 66(2):321–344.
Novak, E. (1996). On the power of adaption. Journal of Complexity, 12(3):199–
238.
Novak, E. and Ritter, K. (1997). The curse of dimension and a universal method
for numerical integration. In Nürnberger, G., Schmidt, J. W., and Walz, G.,
editors, Multivariate approximation and splines, pages 177–187. Springer.

Bibliography 35
Novak, E. and Woźniakowski, H. (2008). Tractability of Multivariate Problems:

Linear Information, volume I. European Mathematical Society, Zurich.

Standard Information for Functionals, volume II. European Mathematical
Society, Zurich.

Standard Information for Linear Operators, volume III. European Mathe-
matical Society, Zurich.
Nuyens, D. and Cools, R. (2006a). Fast algorithms for component-by-component

construction of rank-1 lattice rules in shift-invariant reproducing kernel hilbert
spaces. Mathematics of Computation, 75(254):903–920.
Nuyens, D. and Cools, R. (2006b). Fast component-by-component construc-

tion of rank-1 lattice rules with a non-prime number of points. Journal of
Complexity, 22:4–28.
Owen, A. B. (2019). Effective dimension of some weighted pre-Sobolev spaces

with dominating mixed partial derivatives. SIAM Journal on Numerical Anal-
ysis, 57(2):547–562.
Palacios, J. A. and Minin, V. N. (2012). Integrated nested Laplace approx-

imation for Bayesian nonparametric phylodynamics. In de Freitas, N. and
Murphy, K., editors, Proceedings of the Twenty-Eighth Conference on Uncer-
tainty in Artificial Intelligence, pages 726–735.
Petras, K. (2003). Smolyak cubature of given polynomial degree with few nodes
for increasing dimension. Numerische Mathematik, 93(4):729––753.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007).

Numerical recipes: the art of scientific computing. Cambridge University
Press, Cambridge, 3rd edition.
Ritter, K. (2007). Average-case analysis of numerical problems. Springer-Verlag,

Berlin.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference
for latent Gaussian models by using integrated nested Laplace approxima-
tions. Journal of the Royal Statistical Society, Series B, 71(2):319–392.
Sloan, I. H. and Woźniakowski, H. (1998). When are quasi-Monte Carlo al-

gorithms efficient for high dimensional integration? Journal of Complexity,
14:1–33.
Smolyak, S. A. (1963). Quadrature and interpolation formulas for tensor prod-

ucts of certain classes of functions. Soviet Mathematics Doklady, 4:240–243.

36 Bibliography
Tierney, L. and Kadane, J. B. (1986). Accurate approximateions for posterior

moments and marginal densities. Journal of the American Statistical Associ-
ation, 81(393):82–86.
Tierney, L., Kass, R. E., and Kadane, J. B. (1989). Fully exponential Laplace
approximations to expectations and variances of nonpositive functions. Jour-
nal of the American Statistical Association, 84(407):710–716.
Trefethen, L. N. (2008). Is Gauss quadrature better than Clenshaw–Curtis?
SIAM Review, 50(1):67–87.
Wang, X. and Sloan, I. H. (2006). Efficient weighted lattice rules with applica-
tions to finance. SIAM Journal on Scientific Computing, 28(2):728–750.
Wang, X. and Sloan, I. H. (2007). Brownian bridge and principal component

analysis: towards removing the curse of dimensionality. IMA Journal of Nu-
merical Analysis, 27(4):631–654.
Wong, R. (1989). Asymptotic approximation of integrals. Academic Press, San
Diego.


CH Quadrature

Uploaded by

Copyright:

Available Formats

CH Quadrature

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH Quadrature

Uploaded by

Copyright:

Available Formats

Contents

7 Other integration methods 3

© Art Owen 2009–2019 do not distribute or post electronically without

Other integration methods

7.1 The midpoint rule

for −∞ < a < b < ∞.

If f is Riemann integrable on [a, b] then Iˆ − I → 0 as n → ∞ as long as

© Art Owen 2009–2019 do not distribute or post electronically without

If we have constructed f so that a = 0 and b = 1 then the midpoint rule

Proof. For any x between xi−1 ≡ ti − (b − a)/(2n) and xi ≡ ti + (b − a)/(2n),

Because f 00 is continuous on [a, b] and that interval is compact, f 00 is absolutely

© Art Owen 2009–2019 do not distribute or post electronically without

by symmetry. Then with x1 = (b − a)/n,

The result follows by substituting (7.3) into (7.2).

The integral of fe(x) over [a, b] yields the trapezoid rule

The trapezoid rule is based on a piecewise linear approximation fe to f

© Art Owen 2009–2019 do not distribute or post electronically without

Equation (7.4) supplies us with a computable interval certain to contain the

Singularities at the endpoints

7.2 Simpson’s rule

© Art Owen 2009–2019 do not distribute or post electronically without

As a next step, we find a three-point rule that correctly integrates any

That is, we take

By a change of variable, this three-point rule can be adapted to the interval

Equation (7.5) is known as Simpson’s rule. Equation (7.6) is a compound

7.3 Higher order rules

© Art Owen 2009–2019 do not distribute or post electronically without

© Art Owen 2009–2019 do not distribute or post electronically without

© Art Owen 2009–2019 do not distribute or post electronically without

where a < z < b, provided that f has 2m continuous derivatives. Unlike

© Art Owen 2009–2019 do not distribute or post electronically without

The Clenshaw-Curtis rule evaluates f at

Thus, like the Gauss rule, Clenshaw-Curtis is interpolatory, though it is only

7.4 Fubini, Bahkvalov and curse of dimension

© Art Owen 2009–2019 do not distribute or post electronically without

© Art Owen 2009–2019 do not distribute or post electronically without

Proof. This is given as Theorem 3.1 in Dimov (2008).

Bakhvalov’s theorem makes high-dimensional quadrature seem intractable.

© Art Owen 2009–2019 do not distribute or post electronically without

Proof. This is given as Theorem 3.2 of Dimov (2008).

7.5 Hybrids with Monte Carlo

we might take x1 , . . . , xm ∈ [0, 1]s , w1 , . . . , wm ∈ R and draw y1 , . . . , yn ∼

© Art Owen 2009–2019 do not distribute or post electronically without

7.6 Laplace approximations

Theorem 7.4. The asymptotic equivalence

iii) there exists x∗ ∈ A such that for all  > 0

inf{h(x) − h(x∗ ) | x ∈ A, kx − x∗ k > } > 0,

iv) g is continuous in a neighborhood of x∗ with g(x∗ ) 6= 0, and

The parameter λ is a measure of how strongly spiked f is. In statistical

© Art Owen 2009–2019 do not distribute or post electronically without

Suppose that a parameter θ ∈ Θ ⊆ Rp has prior distribution π0 (θ), and the

and so the posterior distribution of θ given X is

π(θ) = π(θ | X ) ∝ π0 (θ)e`(θ) .

which we may write as

gN (θ̂N )|H̄N (θ̂N )|−1/2 e−HN (θ̂N )

© Art Owen 2009–2019 do not distribute or post electronically without

The Laplace approximation is said to be in standard form when HN (θ) =

|H̄N (θ̂N )|−1/2 e−HN (θ̂N )

iii) there exists x∗ ∈ A such that for all > 0

inf{h(x) − h(x∗ ) | x ∈ A, kx − x∗ k > } > 0,