CH Quadrature
CH Quadrature
CH Quadrature
1
2 Contents
Most of our Monte Carlo problems involve estimating expectations and these
can often be written as integrals. Sometimes it pays to treat the problems
as integrals and apply non-random numerical integration methods. In other
settings we may be able to combine Monte Carlo and other methods into hybrid
estimators. For instance, a nearly exact numerical integral of a problem related
to our own, may be used as a control variate, as described in §8.9. This chapter
is specialized and may be skipped on first reading.
There is a large literature on numerical integration. Here, we look at a few
of the main ideas. Of special importance are the midpoint rule and Simpson’s
rule, for integrating over a finite interval [a, b]. They are simple to use and bring
enormous improvements for smooth functions, and extend well to small dimen-
sions d. We will also see how the advantage of classical quadrature methods
decays rapidly with increasing dimension. This phenomenon is a manifestation
of Bellman’s ‘curse of dimensionality’, with Monte Carlo versions in two classic
theorems of Bakhvalov. This curse is about worst case results and sometimes
we are presented with problems that are more favorable than the worst case.
We consider some results on ‘weighted spaces’ from which the worst functions
are excluded. We also study ‘sparse grids’ in which the number of evaluation
points required grows very slowly with dimension.
The connection between Monte Carlo and integration problems is as follows.
Suppose that our goal is to estimate E(g(y)) where y ∈ Rs has probability
density function p. We find a transformation function ψ(·), using methods like
those in Chapter 5, such that y = ψ(x) ∼ p when x ∼ U(0, 1)d . Then
Z Z Z
E(g(y)) = g(y)p(y) dy = g(ψ(x)) dx = f (x) dx,
Rs (0,1)d (0,1)d
3
4 7. Other integration methods
where f (·) = g(ψ(·)). As a result our Monte Carlo problem can be transformed
into a d-dimensional quadrature. We don’t always have d = s. This method does
not work when acceptance-rejection sampling is included in the way we generate
y, because there is no a priori bound on the number of uniform random variables
that we would need.
Since we’re computing integrals and not necessarily expectations we use the
Rb
symbol I for the quantity of interest. For instance, with I = a f (x) dx we have
I = (b − a)µ where µ = E(f (x)) for x ∼ U[a, b].
When f has a simple closed form, there is always the possibility that I can
be found symbolically. Tools such as Mathematica™, Maple™, and sage can
solve many integration problems. When symbolic computation cannot solve the
problem then we might turn to numerical methods instead.
Numerical integration is variously called quadrature or cubature. Some au-
thors reserve quadrature for the case where y ∈ R because the integral is the
limit of a sum of quadrilateral areas (rectangles or trapezoids). They then use
cubature for more general input dimensions. Hypercubature might be even more
appropriate, especially for d > 3, but that term is seldom used.
Theorem 7.1. Let f (x) be a real-valued function on [a, b] for −∞ < a <
b < ∞. Assume that the second derivative f 00 (x) is continuous on [a, b]. Let
ti = a + (b − a)(i − 1/2)/n for i = 1, . . . , n. Then
Z n
b b−aX (b − a)3
f (x) dx − f (ti ) 6 max |f 00 (z)|.
n i=1 24n2 a6z6b
a
Z b n
ˆ =
b−aX
|I − I| f (x) dx − f (ti )
a n i=1
n xi
X Z
= f (x) − f (ti ) dx
i=1 xi−1
n xi
X Z
0 1 00 2
=
f (ti )(x − ti ) + f (z(x))(x − ti ) dx
i=1 xi−1
2
n Z
1 X xi 00
2
= f (z(x))(x − ti ) dx.
2 i=1 xi−1
from an error cancellation. This kind of cancellation plays a big role in the
development of classical quadrature methods.
Another fact about these two methods is worth mentioning: the bracketing
inequality. Suppose that we know f 00 (x) > 0 for all x ∈ [a, b]. Then Iˆ − I =
f 00 (ẑ)(b − a)3 /(24n2 ) > 0 and Ie − I = −f 00 (e
z )(b − a)3 /(14n2 ) 6 0 and therefore
Iˆ 6 I 6 I.
e (7.4)
(b − a)4 (4)
− f (z), for some z ∈ (a, b).
180 n4
For closed rules we do evaluate f at the end-points of the panel. The trapezoid
rule and Simpson’s rule are both closed. Closed rules have the advantage that
some function evaluations get reused when we increase n. Open rules have
a perhaps greater advantage that they avoid the ends of the interval where
singularities often appear.
Newton-Cotes
The trapezoid rule and Simpson’s rule use m = 2 and m = 3 points respectively
within each panel. In general, one can use m points to integrate polynomials
of degree m − 1, to yield the Newton-Cotes formulas, of which the trapezoid
rule and Simpson’s rule are special cases. The Newton-Cotes rule for m = 4 is
another of Simpson’s rules, called Simpson’s 3/8 rule. Newton-Cotes rules of
odd order have the advantage that, by symmetry, they also correctly integrate
polynomials of degree m, as we saw already in the case of Simpson’s rule.
The next two odd order rules are
Z 4h
. h
f (x) dx = 14f0 + 64f1 + 24f2 + 64f3 + 14f4 , and (7.7)
0 45
Z 6h
. h
f (x) dx = 41f0 + 216f1 + 27f2 + 272f3 + 27f4 + 216f5 + 41f6 .
0 140
These are the 5 and 7 point closed Newton-Cotes formulas. Equation (7.7) is
known as Boole’s rule.
High order rules should be used with caution. They exploit high order
smoothness in the integrand, but can give poor outcomes when the integrand is
not as smooth as they require. In particular, if a genuinely smooth quantity has
some mild nonsmoothness in its numerical implementation f , then high order
integration rules can behave very badly, magnifying this numerical noise.
As a further caution, Davis and Rabinowitz (1984) note that taking f fixed
and letting the order m in a Newton-Cotes formula increase does not always
converge to the right answer even for f with infinitely many derivatives. Lower
order rules applied in panels are more robust.
The Newton-Cotes rules can be made into compound rules similarly to the
way Simpson’s rule was compounded. When the basic method integrates poly-
nomials of degree r exactly within panels, then the compound method has error
O(n−r ), assuming that f (r) is continuous on [a, b].
As noted above, open rules are valuable because they avoid the endpoints
where the function may be singular. Here are a few open rules:
Z 2h
h3
f (x) dx = 2hf (h) + f 00 (z),
0 3
Z 4h 14h5
4h
f (x) dx = 2f (h) − f (2h) + 2f (3h) + f (4) (z), and
0 3 45
Z 5h 95h5
5h
f (x) dx = 11f (h) + f (2h) + f (3h) + 11f (4h) + f (4) (z).
0 24 144
w(x) Rule
1|x|61 Gauss-Legendre
2
exp(−x ) Gauss-Hermite
106x<∞ exp(−x) Gauss-Laguerre
1|x|<1 (1 − x2 )−1/2 Gauss-Chebyshev (1st kind)
1|x|61 (1 − x2 )1/2 Gauss-Chebyshev (2nd kind)
Table 7.1: This table lists some weight functions and the corresponding families
of Gauss quadrature rules.
In each case the point z is inside the interval of integration, and the error term
assumes that the indicated derivative is continuous. The first one is simply the
midpoint rule after a change of variable to integrate over [0, 2h]. The next two
are from Davis and Rabinowitz (1984, Chapter 2.6). They both have the same
order. The last one avoids negative weights but requires an extra point.
Gauss rules
The rules considered above evaluate f at equispaced points. A Gauss rule takes
the more general form
Xm
IˆG = wi f (xi )
i=1
where both xi and wi can be chosen to attain high accuracy.
The basic panel for a Gauss rule is conventionally [−1, 1] or sometimes R,
and not [0, h] as we used for Newton-Cotes rules. Also the target integration
problem is generally weighted. That is, we seek to approximate
Z ∞
f (x)w(x) dx
−∞
for a weight function w(x) > 0. The widely used weight functions are multiples
of standard probability density functions, such as the uniform, gamma, Gaussian
and beta distributions; see Table 7.1. The idea is that having f be nearly a
polynomial can be much more appropriate than requiring the whole integrand
f (x)w(x) to be nearly a polynomial.
Choosing wi and xi together yields 2m parameters and it is then possible
to integrate polynomials of degree up to 2m − 1 without error. It follows that
if f (x) = p(x) where p is a polynomial with p(xi ) = f (xi ) for i = 1, . . . , m,
then IˆG (f ) = I(f ). Because of this, the Gauss rule is called an interpolatory
quadrature formula.
The error in a Gauss rule is
(b − a)2m+1 (m!)4 (2m)
f (z)
(2m + 1)[(2m)!]3
m xi wi
1
2 ±√ 1
3
8
3 0
9
1√ 5
± 15
5 9
√ √
q
1 1
4 ± 525 − 70 30 18 + 30
35 36
√ √
q
1 1
± 525 + 70 30 18 − 30
35 36
128
5 0
225
√ √
q
1 1
± 245 − 14 70 322 + 13 70
21 900
√ √
q
1 1
± 245 + 14 70 322 − 13 70
21 900
Pm
Table 7.2: Gauss quadrature rules i=1 wi f (xi ) to approximate
R1
−1
f (x) dx, for m = 1, . . . , 5. From Weisstein, Eric W, “Legendre-
Gauss Quadrature.” MathWorld web site.—A Wolfram Web Resource.
http://mathworld.wolfram.com/Legendre-GaussQuadrature.html
Clenshaw-Curtis rules
Gauss rules for large n are expensive to construct. Also, the Gauss rules are
not generally nested, meaning that the points xi used at one value of m are not
necessarily used again for a larger value of m. Clenshaw-Curtis rules address
both of these difficulties. This second point makes them very well suited to
sparse grid methods for multidimensional integration, that we consider in §7.8.
As for Gauss rules, we consider integration over [−1, 1] and approximate the
integral I by
m
X
IˆCC = wi f (xi ).
i=1
To see informally what goes wrong, let f (x, y) be a function on [0, 1]2 . Now
write Z 1Z 1 Z 1
I= f (x, y) dx dy = I(y) dy
0 0 0
R1
where I(y) = 0 f (x, y) dx. Suppose that we use the same m point integration
rule with weight ui on point xi , for any value of y, getting
m
X
ˆ =
I(y) ui f (xi , y) = I(y) + E1 (y),
i=1
where |E1 (y)| 6 Am−r holds for some A < ∞ and all 0 6 y 6 1. The value of
r depends on the method we use and how smooth f is. Next we use an n point
rule to average over y getting
Xn Xn Z 1 n
X
Iˆ = ˆ j) =
vj I(y vj I(yj ) + E1 (yj ) = I(y) dy + E2 + vj E1 (yj )
j=1 j=1 0 j=1
where |E2 | 6 Bn−s for some B < ∞ and s depending on the outer integration
rule and on how smooth I(y) is.
The total error is a weighted sum of errors E1 at different points yj plus the
−r
error
PnE2 . We suppose that the weighted sum of E1 (yj ) is O(m ). This −r happens
if j=1 |vj | < C holds for all n because we assumed that |E1 (y)| 6 Am holds
with the same A < ∞ for all y ∈ [0, 1].
The result is that |Iˆ − I| = O(m−r + n−s ). In other words, the convergence
rate that we get is like the worst of the two one-dimensional rules that we
combine.
More generally, if we are using a d-dimensional grid of points and a product
rule
n1 X
X n2 nd Y
X d
Iˆ = ··· wjij f (xi1 , xi2 , . . . , xid )
i1 =1 i2 =1 id =1 j=1
we cannot expect the result to be better than what we would get from the worst
of the rules we have used. Suppose that rule j is the least accurate. Then Iˆ
could hardly be better than
nj
X
wji Ij (xji )
i=1
where Ij (xj ) is the (exact) integral of f over all variables other than xj .
Getting the worst one-dimensional rate leads to a curse of dimensionality.
Suppose that we use the same n point one-dimensional quadrature rule on each
of d dimensions. As a result we use N = nd function evaluations. If the one-
dimensional rule has error O(n−r ), then the combined rule has error
|Iˆ − I| = O(n−r ) = O(N −r/d ).
Even modestly large d will give a bad result. The value r is the smaller of the
number of continuous derivatives f has and the number that the quadrature
rule is able to exploit. Taking r d won’t help in practice, because high order
rules get very cumbersome and many of them are prone to roundoff errors.
This curse of dimensionality is not confined to sampling on grids formed as
products of one-dimensional rules. Any quadrature rule in high dimensions will
suffer from the same problem. Two important theorems of Bakhvalov, below,
make the point.
Theorem 7.2 (Bakhvalov I). For 0 < M < ∞ and integer r > 1, let
d
∂ r f (x)
X
r d
CM = f : [0, 1] → R | α1
6 M, ∀ αj > 0 with αj = r .
∂x · · · ∂xαd
1 d j=1
Then there exists k > 0 such that for any x1 , . . . , xn ∈ [0, 1]d and any w1 , . . . , wn ∈
r
R, there is an f ∈ CM with
X n Z
wi f (xi ) − f (x) dx > kn−r/d
[0,1]d
i=1
r
let w1 , . . . , wn ∈ R. Then there exists k > 0 such that some function f ∈ CM
satisfies
n
X Z 2 1/2
E wi f (xi ) − f (x) dx > kn−1/2−r/d .
i=1 [0,1]d
In Theorem 7.3, the worst case function f is chosen knowing how we will
sample xi , but not knowing the resulting values xi that we will actually use.
Here we see an RMSE of at least O(n−1/2−r/d ) which does not contradict the
MC rate. There is a curse of dimension only in the extent to which we can
improve on MC, namely a factor proportional to n−r/d .
Our hybrid has a curse of dimension driven by the size of s and it has variance
σ 2 /n where
m
X X m
m X
σ 2 = Var wj f (xj , y) = wj wj 0 Cov(f (xj , y), f (xj 0 , y)).
j=1 j=1 j 0 =1
We might expect this hybrid to be useful when s is not too large and f (x, y) is
very smooth in x. Then the inner sum in (7.8) is well handled by quadrature.
If additionally, f has large variance in x at any fixed value for y, then our
quadrature may be much better than using Monte Carlo on both x and y.
If we could
Pn integrate out x in closed form then we could use the estimate
Iˆ = (1/n) i=1 h(yi ) where h(y) = E(f (x, y) | y) for x ∼ U(0, 1)d . This is
the method called conditioning in §8.7. The hybrid (7.8) is conditioning with a
numerical approximation to h. The term preintegration is also used.
Hybrids of Monte Carlo and quasi-MontePCarlo methods are often used. See
n
Chapter 17. They take the form Iˆ = (1/n) i=1 f (xi , yi ) for a quadrature rule
with xi ∈ [0, 1]s and Monte Carlo samples yi ∼ U(0, 1)d .
where |Σ| is the determinant of Σ, using the known normalization of the N (x∗ , Σ)
distribution. Note that |Σ| = | − H −1 | which is the absolute value of 1/|H|.
The Laplace approximation is very accurate when log(f (x)) is smooth and
the quadratic approximation is good where f is not negligible. Such a phe-
nomenon often happens when x is a statistical parameter subject to the central
limit theorem, f (x) is its posterior distribution, and the sample size is large
enough for the CLT to apply.
If we want the integral of g(x)f (x) for smooth g, then we may multiply (7.9)
by g(x∗ ). The following theorem gives the details:
holds as λ → ∞, if
i) A ⊆ Rd is open,
ii) A |g(x)|e−λh(x) dx < ∞ for all λ > λ0 for some λ0 < ∞,
R
Proof. This is Theorem 4.14 of Evans and Swartz (2000) which is based on a
result in Wong (1989).
The value θ̂ = arg maxθ `(θ) is the maximum likelihood estimate of θ. When
θ̂ is interior to Θ and ` is smooth, then we have the Taylor approximation
.
`(θ) = `(θ̂) − 21 (θ − θ̂)T H(θ̂)(θ − θ̂) where
n
X ∂ 2 log(p(xi ; θ)) ˆ
H(θ) = − ≡ nI(θ).
i=1
∂θ∂θT
The quantity I(ˆ θ̂) is known as the empirical Fisher information for θ. Because
H is a sum of n independent and identically distributed terms, we find by the
law of large numbers that H(θ) ≈ −n E ∂ 2 log(p(x; θ))/∂θ∂θT ≡ nI(θ). This
I(θ) is the ordinary (non-empirical) Fisher information for θ. In asymptotic
expansions, it is most convenient to work with nI(θ̂), but to present the method
we work with H(θ̂).
The best predictor of g(θ), with respect to squared error, is
g(θ)π0 (θ)e`(θ) dθ
R
E(g(θ) | X ) = R ,
π0 (θ)e`(θ) dθ
g (θ)e−HN (θ) dθ
R
R N (7.10)
gD (θ)e−HD (θ) dθ
where, for example, gN (θ) is either 1 or g(θ) or g(θ)π0 (θ) with HN (θ) adjusted
accordingly, and similar choices are available for the denominator. We investi-
gate several specific formulations of (7.10) below.
If we take θ̂N and θ̂D to be the minimizers of HN and HD respectively, then
a straightforward use of the Laplace approximation yields
where H̄N and H̄D are the Hessian matrices of HN and HD respectively.
gN (θ̂N ) gN (θ̂N )
E(g(θ)
b |X) = = (7.12)
gD (θ̂D ) gD (θ̂N )
because θ̂D = θ̂N in the standard form. If we take HN (θ) = HD (θ) = `(θ), so
that gN (θ) = g(θ)π0 (θ) and gD (θ) = π0 (θ), we obtain E(g(θ)
b | X ) = g(θ̂) where
θ̂ is the maximum likelihood estimator of θ. If instead, we take HN (θ) = `(θ) −
log(π0 (θ)), with gN (θ) = g(θ) and gD (θ) = 1, then we obtain E(g(θ)
b | X ) = g(θ),
e
where θ is the maximum a posteriori (MAP) estimate of θ.
e
The Laplace approximation is in fully exponential form when gN (θ) =
gD (θ). For example, we might have gN (θ) = gD (θ) = 1 with HN (θ) = `(θ) −
log(θ)−log(g(θ)) and HD (θ) = `(θ)−log(θ). The fully exponential form requires
g(θ) > 0. In the fully exponential form, the estimate (7.11) becomes
contrast is designed to find and sample from multiple modes, although on some
problems it will have difficulty doing so. The Laplace approximation also re-
quires finding the optimum of a d-dimensional function and working with the
Hessian at the mode. In some settings that optimization may be difficult, and
when d is extremely large, then finding the determinant of the Hessian can be
a challenge. Finally, posterior distributions that are discrete or are mixtures of
continuous and discrete parts can be handled by MCMC but are not suitable
for the Laplace approximation.
The Laplace approximation is not completely superceded by MCMC. In par-
ticular, the fully exponential version is very accurate for problems with modest
dimension d and large n. When the optimization problem is readily solved
then it may provide a much more automatic and fast answer than MCMC does.
Furthermore, the Laplace approximation is much faster than MCMC.
The integrated nested Laplace approximation of Rue et al. (2009) is an al-
ternative to MCMC for large scale problems. It uses a weighted sum of Laplace
approximations. If the variables θ have components (η, φ) then it uses a Laplace
approximation to integrate over φ given η summed over a weighted set of val-
ues ηk for k = 1, . . . , K. It is thus a numerical integral over η values of a
Laplace approximation over φ given η. The statistical context behind η and
φ can be complex. See Palacios and Minin (2012) for an example in phylody-
namics, estimating historical sizes of populations from present day genetic data.
See Martino and Riebler (2019) for an introductory account.
where the function fu (x) only depends on those xj for j ∈ u. The ANOVA
decomposition of Appendix A represents f this way, and there are other decom-
positions.
Using (7.14), we may bound the integration error over [0, 1]d by
n
X 1 X Z
|Iˆ − I| 6
n fu (x i ) − fu (x) dx.
[0,1] d
u i=1
If some of the functions fu are always close to zero or are otherwise easy to
integrate, then they might not contribute much to Iˆ − I. In a weighted space
analysis, presented below, we introduce weights γu > 0 for each subset u of
inputs to f . Larger weights γu will allow fu to be a more complicated and
harder to integrate function, while smaller weights γu will force fu to be closer
to zero. Other things being equal, γu is usually chosen to be a decreasing
function of the number of variables included in u (so γ{1,2,3} < γ{1,2} ) and also
of the indices of those variables (so γ{1,2,3} < γ{3,4,5} ). All 2d values γu together
d
are represented by γ ∈ (0, ∞)2 .
To shorten some expressions, we introduce notation 1:d for {1, 2, . . . , d}. For
x ∈ Rd and u ⊆ 1:d we let xu be made up of entries xj with j ∈ u. We also use
∂ |u|
∂ u f (x) ≡ Q f (x)
j∈u ∂xj
Fγ = {f | kf kγ 6 1}, where
X 1 Z Z 2
kf k2γ = ∂ u f (x) dx−u dxu . (7.15)
γu [0,1]|u| [0,1]d−|u|
u⊆1:d
See Exercise 7.6. For example, the contributions of u = {j} and u = {j, k} are
Z 1 2 Z 1 Z 1 2
1 ∂ 1 ∂2
f{j} (x) dxj and
f{j,k} (x) dxj dxk ,
γ{j} 0 ∂xj
γ{j,k} 0 0 ∂xj ∂xk
respectively. The first one takes a mean squared slope of the main effect f{j}
for variable j and penalizes it by 1/γ{j} . The larger γ{j} is, the less that term
contributes to kf kγ and by that measure, the more important xj can be, without
putting f over the complexity budget kf kγ 6 1. For u = {j, k}, it is the mean
squared second order mixed partial that is penalized by 1/γ{j,k} .
Recall that Bahkvalov’s theorems 7.2 and 7.3 measure difficulty through
mixed partial derivatives. It is not just that f changes with xj that complicates
a problem, but also that the nature of this change varies with xk for k 6= j and so
on for other indices. The criterion in (7.16) sharply constrains how complicated
fu can be when γu is small. The smaller γu is, the smaller |∂ u fu |2 must be on
average, for f to fit into Fγ . Q
It is very common to study product weights with γu = j∈u γj where 1 >
γ1 > γ2 > · · · > γd > 0. Then sets u with larger cardinality are modeled as no
more important, and so are sets u containing larger indices j as described above.
Popular choices include γj = j −η for a parameter η > 0. Some additional weight
choices are described in the end notes.
Tractability is a way to describe the integration problem not getting too
much harder as d → ∞. For product weights, we can consider d → ∞ by using
the first d values of γj . Letting X1:n denote all the points x1 , . . . , xn ∈ [0, 1]d ,
the worst case error in Fγ is
1 n
X Z
en,d (X1:n , Fγ ) = sup
f (xi ) − f (x) dx.
f ∈Fγ n i=1 [0,1]d
If we had no points at all to use, then our best estimate is Iˆ = 0 and we would
incur the initial error
Z Z
e0,d (Fγ ) = sup f (x) dx = sup |f (x)| dx.
f ∈F γ d[0,1] f ∈F γd [0,1]
For some P
weighted spaces this minimal number grows rapidly with dimension.
d
Let Γd = j=1 γj . They show that
This function has integral 0 over [0, 1]d . It has variance 1 for x ∼ U[0, 1]d , and
in that sense it is equally hard to integrate by plain Monte Carlo for any d.
From (7.15),
12d
kf kγ = (7.18)
γ1:d
See Exercise 7.8. With γj = j −η , we get kf kγ = (d!)η 12d . Now f /c ∈ Fγ for
c = (d!)η 12d , and so the error in integrating f /c can be brought below ε at a
cost of n = Cε−p function evaluations. The error in integrating f is then below
c × ε = 12d (d!)η ε
at that n. Now suppose that we want to ensure that |Iˆ − I| = |I|
ˆ 6 . Then we
require
ε6 d and n > C−p 12dp (d!)ηp .
12 (d!)η
In this example, larger d requires greater rescaling of fd to fit into Fγ , leading
to a larger initial error, a greater reduction ε in the initial error to attain error
, and finally a required sample size that grows rapidly with d. For functions
like fd that are not really dominated by low dimensional terms, the curse of
dimension can reappear within the initial error.
For instance, for integration over [0, 1], we might take Q1 (f ) = f (1/2) and then
for k > 2, use a trapezoid rule
2k−1
X−1 i 1
1 1
Qk (f ) = k f (0) + f k−1 + f (1) ,
2 2 i=1
2 2
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ●
●●●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●
● ●●
●
●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●
●●
●
●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●●●●
●
●
Figure 7.1: Each panel shows a sparse grid in [0, 1]2 , based on univariate
Clenshaw-Curtis rules. The corresponding quadrature rules are unequally
weighted sums of function values over these sparse grids of points.
R
Equation (7.20) approximates [0,1]d f (x) dx using a weighted sum of function
Qd
evaluations on a grid of j=1 mkj points. We similarly define ∆k1 ⊗ · · · ⊗ ∆kd .
For example, with d = 2,
Q1,d (f ) = ∆(1,...,1) (f ) = Q1 ⊗ Q1 ⊗ · · · ⊗ Q1 (f )
X d
Y
N`,d 6 mkj
k: kkk1 6`+d−1 j=1
Table 7.3: Example sparse grid sizes for d = 8. The third column is from Table
3 of Gerstner and Griebel (1998). The fourth column is the approximate size of
a non-sparse grid. The final column divides an asymptotic estimate for N`,8 by
its value.
and is generally less than that because the formula above counts some of the
input points multiple times. The widely used sparse grid points, such as those
based on Clenshaw-Curtis rules, have mk = O(2k ). Then
compared to md` for the non-sparse grid (7.22) (Gerstner and Griebel, 1998).
Table 7.3 shows an example for d = 8. Lemma 1 of Müller-Gronbach (1998)
gives a sharper result, that
N`,d
lim = 1,
`→∞ 2`−d (` − 1)d−1 /(d − 1)!
for fixed d, though this asymptotic equivalence may be slow to take hold.
The sparse grid estimate can be written
N`,d
X
Q`,d (f ) = Wi f (xi )
i=1
where the sum is over all N`,d points xi (ordered somehow) in the sparse grid
and for each of them Wi is the total of all the weights applied to xi from all of the
∆k that include it. The bookkeeping underlying Wi is somewhat complicated.
See Gerstner and Griebel (1998). Some of these Wi can be negative, even
when all of the Qk have only non-negative wi,k , P owing to the differences Qk −
Qk−1 appearing in the formula for Q`,dP . While i Wi = 1, the appearance of
negative weights can potentially make i |Wi | much larger than one, affecting
the numerical stabilityPof the sparse grid estimate. Novak and Ritter (1997,
Lemma 2) show that i |Wi | = O(log(N`,d )d−1 ), even for non-nested sparse
grid rules. The implied coefficient of log(N`,d )d−1 depends on d. Gerstner and
Griebel (1998) describe a recursive computation of Q`,d that reduces roundoff
error.
To describe the accuracy of sparse grids with the Clenshaw-Curtis quadra-
ture rules, we present results from Novak and Ritter (1997). For α = (α1 , . . . , αd )
∂ kαk1 f (x)
f (α) (x) = αd .
∂xα
1 · · · ∂xd
1
Now, consider these two sets of functions, for r > 1 and d > 1,
Novak and Ritter (1997) defined functions on [−1, 1]d but that difference will
not affect the results we see here. The sets above are the unit balls in the
function classes that they define. For d > 1, the unit ball Fdr constrains more
partial derivatives of f than Cdr does, and so Fdr ( Cdr .
Let Q`,d be a sparse grid quadrature using the Clenshaw-Curtis rules and
let N = N`,d be the number of points in it. Novak and Ritter (1997) show that
there are finite constants cd,r such that
The results for Cdr are within logarithmic factors of the bounds in Bahkvalov’s
theorem. Under the stricter conditions described by Fdr , we see that a much
better rate can be attained. When using sparse grids with Clenshaw-Curtis, we
do not have to know the largest r for which f can be scaled to belong to Cdr
or Fdr . The error satisfies (7.26) for all of those r even though the method is
defined without regard to r.
The rate (7.26) does not set in until N is large enough to make use of r’th
derivative information in the sample. As N increases, ever larger values of r
become relevant. For fixed r the bound in (7.26) decreases linearly versus N on
a log-log plot. With the effective value of r increasing with N , we can get an
ever steepening concave plot of error versus N in a log-log plot. Holtz (2010,
Chapter 4.2) shows some examples of this phenomenon for very smooth f with
d < 10.
Gerstner and Griebel (1998) include several numerical examples of sparse
grid quadrature. In some of their examples, using the Clenshaw-Curtis rule
is much better than the trapezoid rule, while in others the results are close.
Similarly, in some of their examples a great improvement arises from using a
Gauss rule instead of Clenshaw-Curtis, despite the larger sample sizes required
from a non-nested base rule. In other examples the method based on Clenshaw-
Curtis is almost equally effective as using a Gauss rule.
The cost of sparse grids eventually grows rapidly with dimension. Suppose
that we want to include points that differ from 1/2 in s or more of their d
components. Then we must have ` > s + 1. Estimating N`,d by 2` `d−1 we see
that it will become very expensive to have large d with even modest `. The
estimate 2`−d (` − 1)d−1 /(d − 1)! from Müller-Gronbach (1998) is more favorable
but we must recall that it applies to ` → ∞ for fixed d, not d → ∞ for fixed `.
Delayed basis sequences, introduced by Petras (2003) are a method to cope
with the high cost of sparse grids in high dimensions. Given a sequence of sample
sizes m1 , m2 , m3 , . . . , a delayed sequence might use sizes m̃k from a sequence
like
m1 , m2 , m2 , m3 , m3 , m3 , m4 · · · .
The resulting sizes m̃k increase slowly with k. When two consecutive rules Qk−1
and Qk are identical, then ∆k = 0 and need not be computed. Holtz (2010) has
several examples.
Adaptive integration
Press et al. (2007) present some adaptive quadrature methods. The idea is to
take a rule like the midpoint rule and use it with a higher density of points in
some subintervals than others. For example, subintervals where the function
varies more, such as having a larger |f 00 |, get more points while other intervals
get fewer points. The information on where the function varies most is gathered
along with the function values. They also present some multivariate adaptive
quadrature methods.
There is some mild controversy about the use of adaptive methods. There are
theoretical results showing that adaptive methods cannot improve significantly
over non-adaptive ones. There are also theoretical and empirical results showing
that adaptive methods may do much better than non-adaptive ones. These
results are not contradictory. Instead, they make different assumptions about
the problem. For a survey of conditions when adaptation helps, see Novak
(1996).
Hickernell et al. (2013) present a two stage adaptive Monte Carlo method to
estimate a mean to within a guaranteed error size with a guaranteed confidence
where σ 2 (f ) = Rd (f (x) − I(f ))2 p(x) dx, for some reasonable upper bound B.
R
Here p(x) is a probability density function from which they sample. Condi-
tion (7.28) defines a ‘cone’ of functions in the sense that if f satisfies it then so
does cf for c > 0. It is not necessarily convex because (f (x) + g(x))/2 might
not satisfy (7.28) even if both f and g do. Convexity of this type is a com-
mon assumption in theorems that show adaptive methods cannot improve on
non-adaptive ones. As a result, those theorems do not apply to that two stage
algorithm.
Hickernell and Rugama (2016) present an adaptive quasi-Monte Carlo al-
gorithm with guaranteed accuracy under an assumption that the integrand’s
coefficients in a Walsh basis expansion decay in a reasonably, though not nec-
essarily perfectly, monotone way.
Weighted spaces
Hickernell (1998) used weighted spaces to describe problems where multivariable
interactions have decreasing importance as the number of variables involved
increases. His weights took the form γu = γ |u| for 0 < γ < 1, and they serve
to make higher order dependence in f less important. Sloan and Woźniakowski
(1998) added a dependence on the index of the variable and introduced the
product weights described in §7.7. If f ∈ FRγ for product weights with γj = j −η
for η > 0, then f from (7.16), we
R find that (∂ u fu (x))2 dx cannot be very large
2
for large |u|. It follows that fu (x) dx cannot be very large either (Owen,
2019), providing a sense in which those spaces have small effective dimension.
Q
Product and order weights (POD weights) take the form γu = G|u| j∈u γj ,
for some positive constants G|u| . These POD weights were developed to solve
integration problems arising from partial differential equations with random
coefficients. See Kuo et al. (2012). Some additional choices for weights are
included in the survey of quasi-Monte Carlo methods given by Dick et al. (2013).
There have been a few papers about choosing the weighted space for a given
application. Wang and Sloan (2006) say that usually tuned lattice rules consider
higher order interactions to be more important than lower ones and don’t nec-
essarily do better than digital nets (of Chapter 15) on finance problems. Their
idea to tune weights is to choose them so that typical functions have similar
sensitivity indices to the given function. L’Ecuyer and Munger (2012) estimate
weights via γu = σu2 from an ANOVA decomposition.
Wang and Sloan (2007) note two difficulties with weighted spaces: it can be
hard to choose the weights, and, sometimes the given function is not smooth
enough to be in the weighted space. They propose a method of picking weights
proportional to γu = 6|u|/2 wu (f ) where
Z Z 2
wu (f ) = ∂ u f (x) dx−u dxu
[0,1]|u| [0,1]d−|u|
Sparse grids
Sparse grids were introduced by Smolyak (1963). They also go by the names ‘hy-
perbolic cross points’, ‘Boolean method’ and ‘discrete blending method’. Bun-
gartz and Griebel (2004) is a comprehensive reference on sparse grid methods.
Gerstner and Griebel (1998) focus on their use in numerical integration and
Holtz (2010) considers applications to finance.
Pd
The sparse grids presented in §7.8 retained terms ∆k (f ) for j=1 kj 6
` + d − 1. This treats all Pd input dimensions equally. One can instead work
adaptively using k with j=1 vj kj below some bound where vj > 0 is larger
for more important variables j and is smaller for less important ones. It is also
possible to RuseR different quadrature rules on each variable as one might when
1 ∞
estimating 0 0 f (x1 , x2 ) exp(−x2 ) dx1 dx2 . See Gerstner and Griebel (1998).
Holtz (2010) finds that sparse grids are extremely effective on integrands that
are very smooth and are dominated by low dimensional terms in an ANOVA.
He remarks that they are not very robust to non-smoothness of the integrands.
It is possible to use the sparse grid construction on s variables at a time. If
we wantPmto integrate f : [0, 1]ds → R then we P can replace the basic integration
mk s
rules k
i=1 f (x i,k ) for x i,k ∈ [0, 1] by rules i=1 f (xi,k ) with xi,k ∈ [0, 1] .
See Dick et al. (2007) who base a sparse grid rule on quasi-Monte Carlo and
randomized quasi-Monte Carlo rules.
Exercises
7.1. A test for Churg-Strauss syndrome (Masi et al., 1990) correctly detects
it in 99.7% of affected patients. The test falsely reports Churg-Strauss in 15%
of patients without the disease. Suppose that we sample m people at a clinic
and x of them test positive for Churg-Strauss. We are interested in the fraction
p ∈ (0, 1) of visitors to the clinic that really have Churg-Strauss. For a uniform
prior distribution on p the posterior distribution of p is proportional to
b) Use the midpoint rule with n = 1000 to estimate the posterior variance
R1
0
πu (p | x)(p − E(p | x))2 dp
Var(p | x) = R1 .
0 u
π (p | x) dp
c) One third of the patients tested positive. Use the midpoint rule with
n = 1000 to estimate the posterior probability that p 6 1/3,
R 1/3
πu (p | x) dp
P(p 6 1/3 | x) = R0 1 .
0
πu (p | x) dp
7.2. Our theoretical understanding of the midpoint rule suggests that the error
in the first two parts of Exercise 7.1 should decrease as O(1/n2 ). The third part
should not attain this rate because f (p) = 1p61/3 is not twice differentiable.
Use the midpoint rule with n = 106 as if it were the exact answer. Then plot
the absolute error versus n of the midpoint rule for n = 10j for j = 1, 2, 3, 4, 5,
for all three of the parts of Exercise 7.1. Does the predicted n−2 rate hold for
the first two parts? What rate appears to hold in the third part?
7.3. Solve the system of equations
Z n n
X
xp dx = win in , p = 0, . . . , n
0 i=0
for win , for n = 1, 2, 3, 4, 5, 6. Use your results to give the next symmetric rule
after Bode’s rule.
R1
7.4. Mike uses the midpoint rule with n points to approximate 0 f (x) dx by Iˆn ,
and Trish uses the trapezoid rule with the same intervals on the same problem
to get Ien .
a) How many function values does Trish use?
b) How many distinct function values did the two people need?
c) Describe how they could combined their data to fit a larger midpoint rule.
7.5. Verify that equation (7.5) is correct for f (x) = x3 . Show that it fails for
f (x) = x4 .
7.6. Prove equation (7.16).
7.7. The function fd in (7.17) has mean 0 and variance 1, for any dimension
d > 1. It does however cost O(nd) computation to evaluate it n times, and in
that sense it becomes harder to integrate by Monte Carlo as d increases.
Qd
a) Construct a sequence of functions gd (x) = j=1 hj (xj ) on [0, 1] so that
E(gd (x)) = 0 and E(gd (x)2 ) = 1 for x ∼ U[0, 1]d .
b) Find kf kγ for your functions gd , when γj = j −η for η > 0.
7.8. Prove equation (7.18).
33
34 Bibliography
Petras, K. (2003). Smolyak cubature of given polynomial degree with few nodes
for increasing dimension. Numerische Mathematik, 93(4):729––753.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference
for latent Gaussian models by using integrated nested Laplace approxima-
tions. Journal of the Royal Statistical Society, Series B, 71(2):319–392.