Approximate Maximum Likelihood Estimation
arXiv:1507.04553v1 [stat.CO] 16 Jul 2015
Johanna Bertl∗
Aarhus University, Denmark
Gregory Ewing
École polytechnique fédérale de Lausanne, Switzerland
Carolin Kosiol
Andreas Futschik
Vetmeduni Vienna, Austria
Johannes Kepler University Linz, Austria
July 17, 2015
Abstract
In recent years, methods of approximate parameter estimation have attracted considerable interest in complex problems where exact likelihoods
are hard to obtain. In their most basic form, Bayesian methods such as
Approximate Bayesian Computation (ABC) involve sampling from the
parameter space and keeping those parameters that produce data that
fit sufficiently well to the actually observed data. Exploring the whole
parameter space, however, makes this approach inefficient in high dimensional problems. This led to the proposal of more sophisticated iterative
methods of inference such as particle filters.
Here, we propose an alternative approach that is based on stochastic
gradient methods and applicable both in a frequentist and a Bayesian
setting. By moving along a simulated gradient, the algorithm produces a
sequence of estimates that will eventually converge either to the maximum
likelihood estimate or to the maximum of the posterior distribution, in
each case under a set of observed summary statistics. To avoid reaching
only a local maximum, we propose to run the algorithm from a set of
random starting values.
As good tuning of the algorithm is important, we explored several
tuning strategies, and propose a set of guidelines that worked best in our
simulations. We investigate the performance of our approach in simulation studies, and also apply the algorithm to two models with intractable
likelihood functions. First, we present an application to inference in the
context of queuing systems. We also re-analyze population genetic data
and estimate parameters describing the demographic history of Sumatran
and Bornean orang-utan populations.
∗
[email protected]
1
1
Introduction
Both in the Bayesian as well as in the frequentist framework, statistical inference commonly uses the likelihood function. However, under a wide range of
complex models, no explicit formula is available. This occurs for example when
modelling population genetic and evolutionary processes (Marjoram and Tavaré,
2006), in spatial statistics (Soubeyrand et al., 2009), and in queuing systems
(Heggland and Frigessi, 2004). Furthermore, such situations also occur with
dynamical systems as used for example in systems biology (Toni et al., 2009),
and epidemiology (McKinley et al., 2009). Here, we consider statistical models
with an intractable distribution theory, but a known data generating process
under which simulations can be obtained.
A recent approach to overcome this problem is Approximate Bayesian Computation (ABC) (Beaumont et al., 2002). Its basis is a rejection algorithm:
Parameter values are randomly drawn from the prior distribution, data sets are
then simulated under these values. To reduce complexity, informative but low
dimensional summaries are derived from the data sets. All parameter values
that gave rise to summary statistics similar to those computed for the observed
data are then accepted as a sample from the posterior distribution. Hence, ABC
does not involve any kind of direct approximation of the likelihood, but with a
uniform prior (with support on a compact subset of the parameter space taken
large enough to contain the maximum likelihood estimator) it can be used to
obtain a simulation-based approximation of the likelihood surface and, subsequently, the maximum likelihood estimate for the observed summary statistic
values (Creel and Kristensen, 2013; Rubio and Johansen, 2013).
ABC has successfully been used in population genetic applications and also in
other fields (Beaumont, 2010). However, the sampling scheme can be inefficient
in high-dimensional problems: The higher the dimension of the parameter space
and the summary statistics, the lower is the acceptance probability of a simulated data set. Consequently, more data sets need to be simulated to achieve
a good estimate of the posterior distribution. Alternatively, the acceptance
threshold needs to be relaxed, but this increases the bias in the approximation to the posterior distribution. To reduce the sampling space, combinations
of ABC and iterative Monte Carlo methods such as particle filters and Markov
Chain Monte Carlo have been proposed (Beaumont et al., 2009; Wegmann et al.,
2009).
Here, we follow an alternative approach to obtain an approximate maximum
likelihood estimate. Instead of using random samples from the whole parameter space, we propose a stochastic gradient algorithm that approximates the
maximum likelihood estimate. Similar to ABC, it relies on lower-dimensional
summary statistics of the data. In the simultaneous perturbations algorithm,
in each iteration two noisy evaluations of the likelihood are sufficient to obtain
an ascent direction (Spall, 1992). To this end the likelihood is approximated
by kernel density estimation on summary statistics of data simulated under the
parameter value of interest.
Our algorithm is related to a method suggested in Diggle and Gratton (1984)
2
(see also Fermanian and Salanié, 2004). There, an approximate maximum likelihood estimate is obtained using a stochastic version of the Nelder-Mead algorithm. However, the authors explore only applications to 1-dimensional i.i.d.
data. In principle, our approach can also be applied in the context of indirect
inference, where summary statistics are derived from a tractable auxiliary model
(Drovandi et al., 2011).
2
Method
We will start by describing the so called simultaneous perturbation algorithm
for optimizing the expected value of a random function using simulations. Next,
we explain how this algorithm can be adapted to obtain maximum likelihood
estimates. Details on the tuning can be found in the following section.
2.1
General simultaneous perturbation algorithm
Let Y ∈ R be a random variable depending on Θ ∈ Rp . The function L(Θ) =
E(Y | Θ) shall be maximized in Θ, but L(Θ) as well as the gradient ∇L(Θ)
are unknown. If a realization y(Θ) of Y | Θ can be observed for any value of
Θ, the maximizer arg maxΘ∈Rp L(Θ) can be approximated by the simultaneous
perturbation algorithm (Spall, 1992).
Similar to a deterministic gradient algorithm, it is based on the recursion
Θk = Θk−1 + ak ∇L(Θk−1 ),
where ak ∈ R+ is a decreasing sequence. However, as ∇L(Θk−1 ) is unknown, it
is substituted by a rescaled putative ascent direction that is randomly chosen
among the vertices of the unit hypercube. Thus, in iteration k, a random vector
δk with elements
−1 with probability 1/2,
(i)
δk =
(1)
+1 with probability 1/2,
for i = 1, . . . , p is generated, and the ascent direction is approximated by
δk
y (Θk + ck δk ) − y (Θk − ck δk )
,
2ck
with ck ∈ R+ being a decreasing sequence.
2.2
Approximate maximum likelihood algorithm
Suppose, data Dobs are observed under model M with unknown parameter
vector Θ ∈ Rp . Let L(Θ; Dobs ) = p(Dobs | Θ) denote the likelihood of Θ. For
complex models often there is no closed form expression for the likelihood, and
for high dimensional data sets, the likelihood can be difficult to estimate. As in
ABC, we therefore consider L(Θ; Sobs ) = p(Sobs | Θ), an approximation to the
3
original likelihood that uses a d-dimensional vector of summary statistics Sobs
instead of the original data Dobs .
An estimate L̂(Θ; Sobs ) of L(Θ; Sobs ) can be obtained from simulations under M(Θ) using kernel density estimation. By setting y(Θ) = L̂(Θ; Sobs ), we
adopt the simultaneous perturbation algorithm to approximate the maximum
likelihood estimate Θ̂ML of Θ. In more detail, the algorithm is as follows:
Algorithm (AML). Let ak , ck ∈ R+ be two decreasing sequences. Let Hk be
a sequence of symmetric positive
definite d × d matrices and κ a d-dimensional
R
kernel function satisfying Rd κ(x)dx = 1. Choose a starting value Θ0 ∈ Rp .
For k = 1, 2, . . . , K:
1. Choice of a likely ascent direction in Θk−1 :
(a) Generate a random vector δk as defined in equation (1).
(b) Simulate datasets D1− , . . . , Dn− from M(Θ− ) and D1+ , . . . , Dn+ from
M(Θ+ ) with Θ± = Θk−1 ± ck δk .
(c) Compute summary statistics Sj− on dataset Dj− and Sj+ on Dj+ for
j = 1, . . . , n.
(d) Estimate the likelihood L̂(Θ− ; Sobs ) = p̂(Sobs | Θ− ) and L̂(Θ+ ; Sobs ) =
p̂(Sobs | Θ+ ) from the summary statistics S1− , . . . , Sn− and S1+ , . . . , Sn+ ,
respectively, with multivariate kernel density estimation (Wand and
Jones, 1995):
X −1/2
1
κ Hk
Sobs − Sj−
L̂(Θ− ; Sobs ) = p
n det(Hk ) j=1
n
and analogously for L̂(Θ+ , Sobs ).
(e) Estimate the ascent direction ∇l(Θk−1 ; Sobs ) by
+
−
ˆ c ˆl(Θk−1 ; Sobs ) = δk log L̂(Θ ; Sobs ) − log L̂(Θ ; Sobs )
∇
k
2ck
2. Updating Θk :
ˆ c ˆl(Θk−1 ; Sobs )
Θk = Θk−1 + ak ∇
k
Then, the approximate maximum likelihood (AML) estimate is Θ̂AML :=
ΘK .
If, in a Bayesian setting, a prior distribution π(Θ) has been specified, the
algorithm can be modified to approximate the maximum of the posterior distribution, Θ̂MAP , by multiplying L̂(Θ− , Sobs ) and L̂(Θ+ , Sobs ) by π(Θ− ) and
π(Θ+ ), respectively.
4
2.3
Parametric bootstrap
Confidence intervals and estimates of the bias and standard error of Θ̂AML can
be obtained by parametric bootstrap: B bootstrap datasets are simulated from
the model M(Θ̂AML ) and the AML algorithm is run on each dataset to obtain
the bootstrap estimates Θ̂∗AML,1 , . . . , Θ̂∗AML,B . This sample reflects both the
error of the maximum likelihood estimator as well as the approximation error.
We compute simple bootstrap confidence intervals that are based on the assumption that the distribution of Θ̂AML − Θ can be approximated sufficiently
well by the distribution of Θ̂∗AML − Θ̂AML , where Θ̂∗AML is the bootstrap estimator. Then, a two-sided (1 − α)-confidence interval is defined as
h
i
2Θ̂AML − q(1−α/2) (Θ̂∗AML ), 2Θ̂AML − q(α/2) (Θ̂∗AML ) ,
where q(β) (Θ̂∗AML ) denotes the β quantile of Θ̂∗AML,1 , . . . , Θ̂∗AML,B (Davison and
Hinkley, 1997).
3
Tuning guidelines
The performance of the algorithm strongly depends on the choice of the sequences ak and ck . However, their optimal values depend on the unknown likelihood function. Another challenge in the practical application is the stochasticity: large steps can by chance lead very far away from the maximum.
3.1
Choice of ak and ck and convergence diagnostics
Here, we consider sequences of the form
ak =
c
a
α and ck = γ
k
(k + A)
as proposed in Spall (2003). To reflect the varying slope of L in different directions of the parameter space, we choose a ∈ Rp (this is equivalent to scaling the
space accordingly). Choosing α = 1 and γ = 1/6 ensures optimal convergence
speed (Spall, 2003). The optimal choice of a and c depends on the unknown
shape of L. Therefore, we propose the following heuristic based on suggestions
in Spall (2003) and our own experience to determine these values as well as
A ∈ N, a shift parameter to avoid too fast decay of the step size in the first
iterations.
Let K be the number of planned iterations and b ∈ Rp a vector that gives the
desired stepsize in early iterations in each dimension. Choose a starting value
Θ0 .
1. Set c to a small percentage of the total parameter range (usually between
1% and 5%) to obtain c1 .
2. Set A = ⌊0.1 ∗ K⌋.
5
3. Choose a:
(a) Estimate ∇l(Θ0 ; Sobs ) by the median of n1 finite differences approx¯ˆ ˆ
imations (step 1 of the AML algorithm), ∇
c1 l(Θ0 , Sobs ).
(b) Set
a(i) =
b(i) (A + 1)α
¯ˆ ˆ
∇
c1 l(Θ0 ; Sobs )
(i) for i = 1, . . . , p.
As a is determined using information about the likelihood in Θ0 only, it
might not be adequate in other regions of the parameter space. To be able to
distinguish convergence from a too small step size, we simultaneously monitor
the growth of the likelihood function and the trend in the parameter estimates to
adjust a if necessary. Every K0 iterations the following three tests are conducted
on the preceding K0 iterates:
• Trend test (too small a): For each dimension i = 1, . . . , p a trend in
(i)
Θk is tested using the standard random walk model
(i)
(i)
Θk = Θk−1 + β + ǫk ,
where β denotes the trend and ǫk ∼ N (0, σ 2 ). The null hypothesis β = 0
(i)
(i)
can be tested by a t-test on the differences ∆k = Θk − Θk−1 . If a trend
is detected, a(i) is increased by a fixed factor f ∈ R+ .
• Range test (too large a): For each dimension i = 1, . . . , p, a(i) is set
(i)
to a(i) /f if the trajectory of Θk spans more than 70% of the parameter
range.
• Convergence test: Simulate n2 likelihood estimates at Θk−K0 and at
Θk . Growth of the likelihood is then tested by a one-sided Welch’s t-test.
(A standard hypothesis test; testing for equivalence could be used instead,
see Wellek (2010).)
We conclude that the algorithm has converged to a maximum only if the
convergence test did not reject the null hypothesis three times in a row and at
the same time no adjustments of a were necessary. In the applications of the
algorithm presented in the article, f = 1.5.
3.2
Kernel density estimation
To enable the estimation of the gradient even far away from the maximum,
a kernel function with infinite support is helpful. The Gaussian kernel is an
obvious option, but the high rate of decay can by chance cause very large steps
6
leading away from the maximum. Therefore, we use the following modification
of the Gaussian kernel:
(
exp − 12 x′ H −1 x if x′ H −1 x < 1
1/2
√
κ(H x) ∝
otherwise.
exp − 12 x′ H −1 x
In degenerate cases where the likelihood evaluates to zero numerically, we
replace the classical kernel density estimate by a nearest neighbor estimate in
step 1d:
If L̂(Θ− ; Sobs ) ≈ 0 or/and L̂(Θ+ ; Sobs ) ≈ 0 (with “≈” denoting “numerically
equivalent to”), find
−
:= arg min Sj− − Sobs : j = 1, . . . , n
Smin
Sj−
+
:= arg min
Smin
+
Sj
Sj+ − Sobs : j = 1, . . . , n
and recompute the kernel density estimate L̂(Θ− ; s) and L̂(Θ+ ; s) in step 1d
+
−
, respectively, as the only observation.
and Smin
using Smin
To be computationally efficient, we estimate a diagonal bandwidth matrix
Hk using a multivariate extension of Silverman’s rule (Silverman, 1986; Härdle
et al., 2004). Using new estimates in each iteration introduces an additional
level of noise that can be reduced by using a moving average of bandwidth
estimates.
3.3
Starting points
To avoid starting in very unfavourable regions of the likelihood, a set of random
points should be drawn from the parameter space and their simulated likelihood
values compared to find useful starting points. This strategy also helps to avoid
that the algorithm reaches only a local maximum.
3.4
Constraints
The parameter space will usually be subject to constraints (e.g., rates are positive quantities). They can be incorporated by projecting the iterate to the
closest point such that both Θ− as well as Θ+ are in the feasible set (Sadegh,
1997). Even if there are no imperative constraints, it is advisable to restrict the
parameter space to a range of plausible values to prevent the algorithm from
trailing off at random in regions of very low likelihood.
To reduce the effect of large steps within the boundaries, we clamp the step
size at 10% of the range of the feasible set in each dimension.
7
4
Examples
To study the performance of the AML algorithm, we test it on different applications. The first example is the multivariate normal distribution. While
there is no need for simulation based inference for normal models, it allows us
to compare the properties of the AML estimator and the maximum likelihood
estimator. Second, we apply the algorithm to a queuing process where the exact
evaluation of the likelihood can require a prohibitive number of steps. The third
example is from population genetics. Here, we use both simulated data, where
the true parameter values are known, as well as DNA sequence data from a
sample of Bornean and Sumatran orang-utans and estimate parameters of their
ancestral history.
4.1
Multivariate normal distribution
When the AML algorithm is used to estimate the mean vector Θ = (µ1 , . . . , µ10 )
of a 10-dimensional normal distribution with a diagonal VC matrix using the
arithmetic means X̄ = (X̄1 , . . . , X̄10 ) as summary statistics, convergence is
achieved quickly and the results are extremely accurate.
One dataset is simulated under a 10-dimensional normal distribution such
that the maximum likelihood estimator for Θ is Θ̂ML = X̄ ∼ N (5 · 110 , I10 )
where I10 denotes the 10-dimensional identity matrix and 110 the 10-dimensional
vector with 1 in each component. To estimate the distribution of Θ̂AML , the
AML algorithm is run 1000 times on this dataset with summary statistics S =
X̄.
For each AML estimate, 1000 points are drawn randomly on (−100, 100) ×
. . . × (−100, 100). For each of them, the likelihood is simulated and the 5 points
with the highest likelihood estimate are used as starting points. On each of
them, the AML algorithm is run for at least 10000 iterations and stopped as soon
as convergence is reached (for ≈ 90% of the sequences within 11000 iterations;
convergence is tested every K0 = 1000 iterations). Again, the likelihood is
simulated on each result and the one with the highest likelihood is considered
as a realization of Θ̂AML . For each evaluation of the likelihood, n = 100 datasets
are simulated. Based on these 1000 realizations of Θ̂AML , the density, bias and
standard error of Θ̂AML are estimated for each dimension (tab. 1, fig. 1).
The densities of the 10 components of Θ̂AML are symmetric around the maximum likelihood estimate with a standard error that is nearly 20 times smaller
than the standard error of Θ̂ML and a negligible bias. Bootstrap confidence
intervals that were obtained from B = 100 bootstrap samples for 100 datasets
simulated under the same model as above show a close approximation to the
intended coverage probability of 95%.
To investigate the impact of the choice of summary statistics on the results,
we repeat the experiment with the following set of summary statistics:
8
X̄1
X̄2 + X̄3
X̄2 − X̄3
X̄4 + X̄5
X̄5 + X̄6
X̄6 + X̄4
X̄7
X̄7 + X̄8
X̄9
X̄9 · X̄10
(2)
For ≈ 90% of the sequences, convergence was detected within 14000 iterations. Bootstrap confidence intervals are obtained for 100 datasets. Their
coverage probability matches the nominal 95% confidence level closely (tab. 1).
The components behave very similarly to the previous simulations, except for
the estimates of the density for components 9 and 10 (fig. 2). Compared to
the simulations with S = X̄, the bias of components 9 and 10 is considerably
increased, but it is still much smaller than the standard error of Θ̂ML . To investigate how fast the bias decreases with the number of iterations, we re-run the
above described algorithm for 100000 iterations without earlier stopping (fig. 3).
Both bias and standard error decrease with the number of iterations.
Table 1: Properties of Θ̂AML in the 10-dimensional normal distribution model
using 2 different sets of summary statistics
S = X̄
dim.
1
2
3
4
5
6
7
8
9
10
b̂
-0.0016
0.0017
0.0004
-0.0033
-0.0015
-0.0000
0.0017
-0.0007
-0.0013
0.0001
se
b
0.0546
0.0544
0.0547
0.0567
0.0584
0.0565
0.0557
0.0559
0.0554
0.0554
S as in eq. 2
p̂
93
94
94
90
95
94
96
95
91
99
b̂
-0.0007
-0.0002
0.0007
-0.0032
-0.0000
0.0044
0.0035
-0.0030
0.0809
-0.0595
se
b
0.0660
0.0638
0.0635
0.0789
0.0748
0.0757
0.0686
0.1140
0.0658
0.0922
p̂
93
93
97
98
90
90
95
96
98
92
Bias (b̂) and standard error (se)
b of Θ̂AML , estimated from 1000 runs of the AML
algorithm. Coverage probability of bootstrap 95% confidence intervals (p̂) with
B = 100, estimated from 100 datasets.
4.2
Queueing process
The theory of queueing processes is a very prominent and well-studied field
of applied probability with applications in many different areas like production management, communication and information technology, and health care
(Kingman, 2009; Gross et al., 2008). Here, we consider a first-come-first-serve
queuing process with a single server, a Markovian distribution of interarrival
times of customers and a general distribution of service times (M/G/1). It
is modelled by the following sequences of random variables: The service time
9
6
7
3.9
4.0
5.8
5.9
6.0
6.1
3.9
4.0
4.1
5.9
6.0
6.1
5.0
5.1
^ (6)
µ
AML
2
0
3.7
3.8
3.9
5.2
5.3
5.4
4.0
4.1
7
6
5
4
3
2
5.1
5.2
5.2
5.3
5.3
5.4
5.5
^ (5)
µ
AML
1
5.0
^ (7)
µ
AML
5.1
^ (4)
µ
AML
0 1 2 3 4 5 6 7
6
2
0
3.8
5.8
^ (3)
µ
AML
4
6
4
2
0
3.7
0
5.7
^ (2)
µ
AML
0 1 2 3 4 5 6 7
3.8
^ (1)
µ
AML
0
3.7
0
0
1
1
2
2
2
3
3
4
4
4
4
5
5
6
6
6
6
5
4
3
2
1
0
3.6
5.4
4.5
4.6
^ (8)
µ
AML
4.7
4.8
2.9
3.0
^ (9)
µ
AML
3.1
3.2
3.3
^ (10)
µ
AML
5
4
3.5
3.6
3.7
3.8
3.9
4.0
3
3
4.1
5.8
5.9
6.0
6.1
6.2
5.8
5.9
6.0
6.1
6.2
2
0
0
5.7
^ (2)
µ
AML
3.6
3.7
3.8
^ (3)
µ
AML
3.9
4.0
4.1
4.2
5.0
5.1
^ (4)
µ
AML
5.2
5.3
5.4
5.5
5.6
^ (5)
µ
AML
4
5
3.0
3
4
3.6
3.7
3.8
3.9
^ (6)
µ
AML
4.0
4.1
4.2
4.9
5.0
5.1
^ (7)
µ
AML
5.2
5.3
5.4
0
4.8
5.0
5.2
5.4
^ (8)
µ
AML
5.6
5.8
0
1
1
2
1.0
0.0
0
0
1
1
2
2
2
3
2.0
3
3
4
4
5
5
6
^ (1)
µ
AML
1
1
2
1
0
0
0
1
1
2
2
2
3
3
3
4
4
4
4
5
5
5
6
6
6
5
Figure 1: Density of the components of Θ̂AML obtained with S = X̄ in one
dataset estimated from 1000 converged sequences with a miminum length of
10000 iterations by kernel density estimation. Vertical dashed line: Θ̂ML .
4.3
4.4
4.5
4.6
^ (9)
µ
AML
4.7
4.8
3.0
3.2
3.4
3.6
^ (10)
µ
AML
Figure 2: Density of the components of Θ̂AML obtained with S as in eq. (2) in
one dataset estimated from 1000 converged sequences with a miminum length
of 10000 iterations by kernel density estimation. Vertical dashed line: Θ̂ML .
10
0.075
^ ^ (9)
b (µ
AML)
^ ^ (10)
b (µ
AML)
^ (9) )
se(µ
AML
(10)
0.045
0.055
0.065
^ )
se(µ
AML
20000
40000
60000
80000
100000
Iteration
Figure 3: Absolute bias and standard error of the AML estimator of µ9 and
µ10 using the summary statistics in eq. (2) estimated from 1000 runs of the
algorithm on the same dataset.
Um , the interarrival time Wm and the interdeparture time Ym for customer
m = 1, . . . , M (Heggland and Frigessi, 2004). Um and Wm are independent
and their distributions are known except for the parameters. Here we assume
Um ∼ U ([θ1 , θ1 + θ2 ]) and Wm ∼ exp(θ3 ). Ym is defined as
m
m−1
P
P
if
Wi ≤
Yi ,
Um
i=1
i=1
Ym =
m
m−1
P
P
Um +
Wi −
Yi otherwise
i=1
i=1
for m = 1, . . . , M .
If only the interdeparture times Ym are observed, the evaluation of the likelihood of the parameter vector Θ = (θ1 , θ2 , θ3 ) requires an exponential number of
steps in M (Heggland and Frigessi, 2004). Approximate inference has been conducted with indirect inference (Heggland and Frigessi, 2004) and ABC methods
(Blum and François, 2010).
To estimate the distribution of Θ̂AML in this model, we simulate 100 datasets
of size M = 100 under the same parameter value Θ = (1, 4, 0.2) and compute
Θ̂AML for each of them. The summary statistics are the minimum, maximum
and the quartiles of Y1 , . . . , YM (Blum and François, 2010). To estimate the
standard error of Θ̂AML per dataset, we re-run the algorithm five times on each
dataset.
Each time, we randomly draw 100 starting values from the search space
(0, 10) × (0, 10) × (0.05, 10) and run the AML algorithm on the best five starting
values for 5000 iterations (runtime of an implementation in R, version 3.0.1
R Core Team (2013): 1.9 hours on a single core). Finally, we estimate the
11
likelihood on the 5 results and the best one is used as a realization of Θ̂AML .
Each likelihood estimation is based on n = 100 simulations. For 3 datasets,
the AML algorithm is run 100 times to obtain kernel density estimates of the
density of Θ̂AML .
0
θ1
2
4
Table 2: Properties of Θ̂AML in the M/G/1 queue
0.0
θ2
0.4
0.8
0.2 0.4 0.6 0.8 1.0 1.2
4
6
8
0
θ3
2
40 80
0
0.15
0.20
0.25
0.30
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
1
(0, 10)
0.942
0.948
-0.058
0.062
0.098
5
(0, 10)
5.031
4.891
0.031
0.437
1.046
0.2
(0.05, 10)
0.229
0.226
0.029
0.004
0.024
Figures: marginal densities of components of Θ̂AML , estimated by kernel density
estimation. Black solid line: density of Θ̂AML estimated over 100 simulated data
sets. Coloured dotted and dashed lines: density of Θ̂AML estimated separately
in 3 example datasets. Vertical black line: true parameter value.
Summaries: true: true parameter value; space: search space; mean: mean of
Θ̂AML ; median: median of Θ̂AML ; bias: bias of Θ̂AML ; mean se: mean standard
error of Θ̂AML per dataset; total se: standard error of Θ̂AML .
In all three dimensions, the standard error of Θ̂AML is considerably larger
across datasets than within dataset (tab. 2). This indicates that the approximation algorithm only adds a small additional error to the error of Θ̂ML . In all
dimensions the density of Θ̂AML is slightly asymmetric around the true value,
but only the bias of θ̂3 is of the same order as the standard error. It is probably caused by a strongly skewed likelihood function (Blum and François, 2010)
that emphasises the error of the finite differences approximation to the slope.
Additional simulations of the likelihood on a grid between θ̂3 and θ3 locate the
maximum around 0.209. Re-running the algorithm for K = 10000 iterations
12
reduces the bias from 0.029 to 0.024. The parameter estimate with the largest
standard error is θ̂2 which is also the most difficult parameter to estimate with
ABC (Blum and François, 2010).
4.3
Population genetics
The goal of population genetics is to understand how evolutionary forces like
mutation, recombination, selection and genetic drift shape the variation within
a population. A major challenge in the application of population genetic theory
is to infer parameters of the evolutionary history of a population from DNA sequences of present-day samples only. The questions range from estimating single
parameters like the mutation rate or the recombination rate to determining a
complex demographic structure with multiple parameters.
Since the introduction of the coalescent by Kingman in the early eighties
(Kingman, 1982a,b,c), a very elegant and flexible stochastic process is at hand
to model the ancestral history of a population. The coalescent can capture population structure, varying population size and different mating schemes. Extensions to incorporate recombination and selection exist as well (Nordborg, 2007;
Wakeley, 2008).
However, statistical inference on data obtained from a coalescent process is
difficult: Under the coalescent, computing the likelihood of a parameter vector is
a computationally very expensive task even for a single locus, since all possible
ancestral trees need to be taken into account (Stephens, 2007). The number
of binary tree topologies explodes with the number of sampled individuals, n,
e. g. with n = 10, the number of possible topologies is 2, 571, 912, 000, with
n = 1000, it is already 3.01 · 104831 (Wakeley, 2008, p. 83, table 3.2).
Consequently, exact computation of the likelihood is feasible in reasonable
time for a very restricted set of models and a small number of haplotypes only
(Griffiths and Tavaré, 1994; Wu, 2010). To make use of the large DNA data sets
generated by modern sequencing technology, approximate methods have been
developed in frequentist as well as in Bayesian frameworks. In fact, the need
for inferential tools that allow for large genomic data sets was a major driver
in the development of ABC. The focus of the early ABC papers was clearly on
population genetic applications (Weiss and von Haeseler, 1998; Pritchard et al.,
1999; Beaumont et al., 2002) and since then it has found a large variety of
applications in the field of population genetics (Csilléry et al., 2010; Beaumont,
2010; Bertorelle et al., 2010).
4.3.1
Population genetic history of orang-utans
Pongo pygmaeus and Pongo abelii, Bornean and Sumatran orang-utans, respectively, are Asian great apes whose distributions are exclusive to the islands
of Borneo and Sumatra. Recurring glacial periods that led to a cooler, drier
and more seasonal climate might have contracted rain forest and isolated the
population of orang-utans. At the same time, the sea level dropped and land
bridges among islands created opportunities for migration among previously
13
isolated populations. However, whether glacial periods have been an isolating
or a connecting factor remains poorly understood. Therefore, there has been a
considerable interest in using genetic data to understand the demographic history despite the computational difficulties involved in such a population genetic
analysis. We will compare our results to the analysis of the orang-utan genome
paper (Locke et al., 2011) and a more comprehensive study by Ma et al. (Ma
et al., 2013); the only analyses that have been performed genome-wide. Both
studies use DaDi (Gutenkunst et al., 2009), a state-of-art software that has been
widely used for demographic inference. It is based on the diffusion approximation to the coalescent. The orang-utan SNP data consisting of 4-fold degenerate
(synonymous) sites are taken from the 10 sequenced individuals (5 individuals
each per orang-utan population, haploid sample size 10, Locke et al., 2011).
As in Locke et al. (2011) and Ma et al. (2013), we consider an IsolationMigration (IM) model where a panmictic ancestral population of effective size
NA splits τ years ago into two distinct populations of constant effective size NB
(the Bornean population) and NS (the Sumatran population) with backward
migration rates µBS (fraction of the Bornean population that is replaced by
Sumatran migrants per generation) and µSB (vice versa; fig. 4a).
n
n-1
NA
n-2
n-3
n-4
deme 1...
time
4
3
τ
µBS
2
1
0
NB
Borneo
µSB
NS
Sumatra
0
(a)
1
2
3
. . . n-4 n-3 n-2 n-1
deme 2
4
n
(b)
Figure 4: (a) Isolation-migration model for the ancestral history of orangutans. NA , effective size of the ancestral population; µBS (µSB ), fraction of
the Bornean (Sumatran) population that is replaced by Sumatran (Bornean)
migrants per generation (backwards migration rate); τ , split time in years; NB
(NS ), effective population size in Borneo (Sumatra).
(b) Binned joint site frequency spectrum (adapted from Naduvilezhath et al.,
2011).
NA is set to the present effective population size that we obtain using the
number of SNPs in our considered data set and assuming an average per gener14
ation mutation rate per nucleotide of 2 · 10−8 and a generation time of 20 years
(Locke et al., 2011), so NA = 17400.
There are no sufficient summary statistics at hand, but for the IM model the
joint site frequency spectrum (JSFS) between the two populations was reported
to be a particularly informative summary statistic (Tellier et al., 2011). However, for N samples in each of the two demes, the JSFS has (N + 1)2 − 2 entries,
so even for small datasets it is very high-dimensional. To reduce this to more
manageable levels we follow Naduvilezhath et al. (2011) and bin categories of
entries (fig. 4b). As the ancestral state is unknown, we use the folded binned
JSFS that has 28 entries. To incorporate observations of multiple unlinked loci
mean and standard deviation across loci are computed for each bin, so the final
summary statistics vector is of length 56.
The AML algorithm with the JSFS as summary statistics was implemented
as an extra module in the msms package, a coalescent simulation program (Ewing and Hermisson, 2010). This allows for complex demographic models and
permits fast summary statistic evaluation without using external programs and
the associated performance penalties.
4.3.2
Simulations
Before applying the AML algorithm to the actual orang-utan DNA sequences,
we tested it on simulated data. Under the described IM model with parameters
NB = NS = 17400, µBS = µSB = 1.44 · 10−5 and τ = 695000, we simulated 25
datasets with 25 haploid sequences per deme, each of them consisting of 75 loci
with 130 SNPs each.
For each dataset, 25 AML estimates were obtained with the same scheme:
1000 random starting points were drawn from the parameter space; the likelihood was estimated with n = 40 simulations. Then, the 5 values with the
highest likelihood estimates were used as starting points for the AML algorithm. The algorithm converged after 3000-25000 iterations (average: ≈ 8000
iterations; average runtime of msms: 11.7 hours on a single core).
For the migration rates µSB and µBS , the per dataset variation of the estimates is considerably smaller than the total variation (tab. 3). This suggests
that the approximation error of the AML algorithm is small in comparison to
the error of Θ̂ML . For the split time τ and the population sizes NS and NB ,
the difference is less pronounced, but still apparent. For all parameters, the
average bias of the estimates is either smaller or of approximately the same
size as the standard error. As the maximum likelihood estimate itself cannot be
computed, it is impossible to disentangle the bias of Θ̂ML and an additional bias
introduced by the AML algorithm. As an alternative measure of performance,
we compare the likelihood estimated at the true parameter values and the AML
estimate with the highest estimated likelihood on each dataset. Only in 2 of
the 25 datasets, the likelihood at the true value is higher, but not significantly
so, whereas it is significantly lower in 12 of them (significance was tested with a
one-sided Welch test using a Bonferroni-correction to account for multiple testing). This suggests that the AML algorithm usually produces estimates that
15
are closer to the maximum likelihood estimate than the true parameter value
is.
To investigate the impact of the underlying parameter value on the quality
of the estimates, simulation results were obtained also for 25 datasets simulated
with τ twice as large. Here, all parameter estimates, especially τ , NB and
NS had larger standard errors and biases (tab. 4). Apparently, the estimation
problem is more difficult for more distant split times. This can be caused by a
flatter likelihood surface and by stronger random noise in the simulations. Only
the migration rates are hardly affected by the large τ : a longer divergence time
allows for more migration events that might facilitate their analysis.
4.3.3
Real data
We model the ancestral history of orang-utans with the same model and use
the same summary statistics as tested in the simulations. Also the datasets are
simulated in the same manner.
To study the distribution of Θ̂AML on this dataset, the algorithm is run
20 times with the same scheme as in the simulations, so we can estimate the
standard error of Θ̂AML in this single dataset. The best one is used as the final
parameter estimate Θ̂AML and is used for bootstrapping.
Confidence intervals are obtained by parametric bootstrap with B = 1000
bootstrap datasets. The bootstrap replicates are also used for bias correction
and estimation of the standard error (tab. 5).
In Locke et al. (2011) and Ma et al. (2013), parameters of two different IM
models are estimated, denote the estimates Θ̂1 and Θ̂2 . Scaled to the ancestral
population size NA = 17372, the estimates are shown in tab. 6.
Model 1 is identical to the model considered here, so we simulate the likelihood at Θ̂1 within our framework for comparison. Since log L̂(Θ̂1 ) = −217.015
(se = 7.739) is significantly lower than log L̂(Θ̂AML ) = −162.732 (se = 7.258),
it seems that Θ̂AML is closer to the maximum likelihood estimate than the competing estimate. Note, however, that we are only using a subset of the data to
avoid sites under selection and that the authors report convergence problems of
DaDi in this model.
For model 2, the ancestral population splits in two subpopulations of size
s and 1 − s relative to the ancestral population and the subpopulations experience exponential growth. Here a direct likelihood comparison with the DaDi
estimates given in Locke et al. (2011) is impossible. However, a rough comparison shows that the AML estimates for τ , NB and NS lie between Θ̂1 and Θ̂2
and for µBS and µSB they are of similar size.
4.3.4
Orang-utan data set
This real data is based on data sets of two publications, De Maio et al. (2013) and
Ma et al. (2013). For the first, CCDS alignments of H. sapiens, P. troglodytes
and P. abelii (references hg18, panTro2 and ponAbe2) were downloaded from
16
0
µBS
150000
Table 3: Properties of Θ̂AML in the IM model with short divergence time (τ =
695000 years)
2.5e−05
3.5e−05
0
µSB
1.5e−05
150000
5.0e−06
3.5e−05
1500000
2500000
3500000
0.00000
0.00015
500000
NS
2.5e−05
0.0e+00
τ
1.5e−05
1.5e−06
5.0e−06
30000
40000
50000
NB
0.00000
0.00015
20000
10000
20000
30000
40000
50000
Figures and summaries: As in Tab. 2.
17
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
1.44e-05
(1.44e-06, 0.000144)
1.3e-05
1.21e-05
-1.43e-06
2e-06
4.48e-06
1.44e-05
(1.44e-06, 0.000144)
1.43e-05
1.39e-05
-7.2e-08
2.19e-06
3.77e-06
695000
(139000, 6950000)
1020000
946000
330000
305000
389000
17400
(1740, 174000)
21700
21000
4370
3070
4100
17400
(1740, 174000)
22400
21800
5070
3160
4550
0
µBS
100000
Table 4: Properties of Θ̂AML in the IM model with long divergence time (τ =
1390000 years)
µSB
3e−05
4e−05
1.5e−05
2.5e−05
0e+00
4e−07
5.0e−06
τ
2e−05
0 100000
1e−05
2e+06
4e+06
6e+06
6e−05
0e+00
0e+00
NS
NB
30000
50000
0e+00 6e−05
10000
10000
30000
50000
70000
Figures and summaries: As in Tab. 2.
18
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
true
space
mean
median
bias
mean se
total se
1.44e-05
(1.44e-06, 0.000144)
1.24e-05
1.12e-05
-1.97e-06
3.38e-06
4.99e-06
1.44e-05
(1.44e-06, 0.000144)
1.25e-05
1.17e-05
-1.88e-06
3.28e-06
4.38e-06
1390000
(139000, 6950000)
2360000
2280000
967000
876000
1e+06
17400
(1740, 174000)
25700
24600
8350
6620
7560
17400
(1740, 174000)
26000
25200
8670
6680
7870
Table 5: Parameter estimates for the ancestral history of orang-utans
1 This
Θ̂AML
Θ̂∗AML
se
b∗
se
b
lower
upper
µBS
5.023e-06
4.277e-06
1.244e-06
1.992e-07
01
6.627e-06
µSB
τ
NS
NB
4.600e-06 1300681 52998 21971
3.806e-06 1402715 52715 22233
9.366e-07 208391 7223 2779
1.066e-07 194868 6083 2963
1.633e-07 715590 31476 13290
4.931e-06 1820852 67118 27426
confidence interval was shifted to the positive range. The original value was
(-9.09e-07,5.72e-06).
Θ̂AML , approximate maximum likelihood estimate; Θ̂∗AML , bootstrap bias corrected estimate; se
b ∗ , bootstrap standard error of Θ̂AML ; se,
b standard error of
Θ̂AML in this dataset, estimated from 20 replicates of Θ̂AML ; lower and upper
limits of the 95% simultaneous bootstrap confidence intervals. All bootstrap
results were obtained with B = 1000 bootstrap replicates. The simultaneous
95% confidence intervals are computed following a simple Bonferroni argument,
so the coverage probabilities are 99% in each dimension (Davison and Hinkley,
1997).
Table 6: Results from (Locke et al., 2011, Tab. S21-1) (same results for Model
2 reported in Ma et al. (2013)), scaled with Ne = 17400.
µBS
µSB
τ
NS
NB
Model 1 9.085e-07 7.853e-07 6948778 129889 50934
Model 2 1.518e-05 2.269e-05 630931 35976 10093
Θ̂AML 5.023e-06 4.600e-06 1300681 52998 21971
Model 1: IM model in figure 4a. Model 2: IM-model where the ancestral
population splits in two subpopulations with a ratio of s = 0.503 (estimated)
going to Borneo and 1 − s to Sumatra and exponential growth in both
subpopulations (Locke et al., 2011, Fig. S21-3). Here, NB and NS are the
present population sizes.
19
the UCSC genome browser (http://genome.ucsc.edu). Only CCDS alignments satisfying the following requirements were retained for the subsequent
analyses: divergence from human reference below 10%, no gene duplication in
any species, start and stop codons conserved, no frame-shifting gaps, no gap
longer than 30 bases, no nonsense codon, no gene shorter than 21 bases, no
gene with different number of exons in different species, or genes in different
chromosomes in different species (chromosomes 2a and 2b in non-humans were
identified with human chromosome 2). From the remaining CCDSs (9,695 genes,
79,677 exons) we extracted synonymous sites. We only considered third codon
positions where the first two nucleotides of the same codon were conserved in
the alignment, as well as the first position of the next codon.
Furthermore, orang-utan SNP data for the two (Bornean and Sumatran)
populations considered, each with 5 sequenced individuals Locke et al. (2011),
were kindly provided by X. Ma and are available online (http://www.ncbi.nlm.
nih.gov/projects/SNP/snp_viewTable.cgi?type=contact&handle=WUGSC_SNP&batch_
id=1054968). The final total number of synonymous sites included was 1,950,006.
Among them, a subset of 9750 4-fold degenerate synonymous sites that are polymorphic in the orang-utan populations were selected.
5
Discussion
In this article, we propose an algorithm to approximate the maximum likelihood
estimator in models with an intractable likelihood. Simulations and kernel density estimates of the likelihood are used to obtain the ascent directions in a
stochastic approximation algorithm, so it is flexibly applicable to a wide variety
of models.
Alternative simulation based approximate maximum likelihood methods have
been proposed that estimate the likelihood surface from samples from the whole
parameter space that are obtained in an ABC like fashion (Creel and Kristensen,
2013; Rubio and Johansen, 2013) or using MCMC (de Valpine, 2004). The
maximum likelihood estimator is obtained subsequently by standard numerical
optimization. Conversely, our method only uses simulations and estimates of
the likelihood along the trajectory of the stochastic approximation algorithm,
thereby approximating the maximum likelihood estimator more efficiently. This
is along the lines of Diggle and Gratton (1984) where a stochastic version of the
Nelder-Mead algorithm is used, but through the use of summary statistics we
can drop the restrictive i.i.d. assumption. In the setting of hidden Markov
models, Ehrlich et al. (2013) have proposed a recursive maximum likelihood
algorithm that also combines ABC methodology with the simultaneous perturbations algorithm. A thorough comparison of the properties of different approximate maximum likelihood methods is beyond the scope of this paper, but
the three presented examples show that the algorithm provides fast and reliable
approximations to the corresponding maximum likelihood estimators.
The population genetics application where we estimate parameters of the
evolutionary history of orang-utans demonstrates that very high-dimensional
20
summary statistics (here: 56 dimensions) can be used successfully without any
dimension-reduction techniques. Usually, high-dimensional kernel density estimation is not recommended because of the curse of dimensionality (e.g., Wand
and Jones, 1995), but stochastic approximation algorithms are explicitly designed to cope with noisy measurements. To this end, we also introduce modifications of the algorithm that reduce the impact of single noisy likelihood estimates. In our experience, this is crucial in settings with a low signal-to-noise
ratio.
Furthermore, the examples show that the AML algorithm performs well in
problems with a high-dimensional and large parameter space: In the normal
distribution example, the 10-dimensional maximum likelihood estimate is approximated very precisely even though the search space spans 200 times the
standard error of Θ̂ML in each dimension.
However, we also observe a bias for a few of the estimated parameters.
Partly, for example for one of the parameters of the queuing process, this can
be attributed to a bias of the maximum likelihood estimator itself. In addition, it
is known that the finite differences approximation to the gradient in the KieferWolfowitz algorithm causes a bias that vanishes only asymptotically (Spall,
2003), and that is possibly increased by the finite-sample bias of the kernel
density estimator. In most cases though, the bias is smaller than the standard
error of the approximate maximum likelihood estimator and can be made still
smaller by carrying out longer runs of the algorithm.
As sufficiently informative summary statistics are crucial for the reliability of
both maximum likelihood and Bayesian estimates, the quality of the estimates
obtained from our AML algorithm will also depend on an appropriate choice
of summary statistics. This has been discussed extensively in the context of
ABC (Fearnhead and Prangle, 2012; Blum et al., 2013). General results and
algorithms to choose a small set of informative summary statistics should carry
over to the AML algorithm.
In addition to the point estimate, we suggest to obtain confidence intervals
by parametric bootstrap. The bootstrap replicates can also be used for bias correction. Resampling in models where the data have a complex internal structure
catches both the noise of the maximum likelihood estimator as well as the approximation error. Alternatively, the AML algorithm may also complement the
information obtained via ABC in a Bayesian framework: the location of the
maximum a posteriori estimate can be obtained from the AML algorithm.
The presented work shows the broad applicability of the approximate maximum likelihood algorithm and also its robustness in highly stochastic settings.
With the implementation in the coalescent simulation program msms (Ewing and
Hermisson, 2010) its potential for population genetic applications can easily be
explored further.
21
Acknowledgments
Johanna Bertl was supported by the Vienna Graduate School of Population Genetics (Austrian Science Fund (FWF): W1225-B20) and worked on this project
while employed at the Department of Statistics and Operations Research, University of Vienna, Austria. The computational results presented have partly
been achieved using the Vienna Scientific Cluster (VSC). The orang-utan SNP
data was kindly provided by X. Ma. Parts of this article have been published
in the PhD thesis of Johanna Bertl at the University of Vienna.
References
M. A. Beaumont. Approximate Bayesian computation in evolution and ecology.
Annual Review of Ecology, Evolution and Systematics, 41:379–406, 2010.
M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162:2025–2035, 2002.
M. A. Beaumont, J.-M. Cornuet, J.-M. Marin, and C. P. Robert. Adaptive
approximate Bayesian computation. Biometrika, 2009.
G. Bertorelle, A. Benazzo, and S. Mona. ABC as a flexible framework to estimate
demography over space and time: some cons, many pros. Molecular Ecology,
19(13):2609–2625, 2010.
M. G. B. Blum and O. François. Non-linear regression models for approximate
Bayesian computation. Statistics and Computing, 20:63–73, 2010.
M. G. B. Blum, M. A. Nunes, D. Prangle, and S. A. Sisson. A comparative review of dimension reduction methods in approximate Bayesian computation.
Statistical Science, 28(2):189–208, 2013.
M. Creel and D. Kristensen. Indirect Likelihood Inference (revised). UFAE and
IAE Working Papers 931.13, Unitat de Fonaments de l’Anàlisi Econòmica
(UAB) and Institut d’Anàlisi Econòmica (CSIC), June 2013. URL http:
//ideas.repec.org/p/aub/autbar/931.13.html.
K. Csilléry, M. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian
computation (ABC) in practice. Trends in Ecology and Evolution, 25:410–418,
2010.
A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Applications.
Cambridge University Press, Cambridge, 1997.
N. De Maio, C. Schlötterer, and C. Kosiol. Linking great apes genome evolution
across time scales using polymorphism-aware phylogenetic models. Molecular
Biology and Evolution, 30(10):2249–2262, 2013.
22
P. de Valpine. Monte carlo state-space likelihoods by weighted posterior kernel
density estimation. Journal of the American Statistical Association, 99(466):
523–536, 2004.
P. J. Diggle and R. J. Gratton. Monte Carlo methods of inference for implicit
statistical models. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):193–227, 1984.
C. C. Drovandi, A. N. Pettitt, and M. J. Faddy. Approximate Bayesian computation using indirect inference. Journal of the Royal Statistical Society: Series
C (Applied Statistics), 60(3):317–337, 2011.
E. Ehrlich, A. Jasra, and N. Kantas. Gradient free parameter estimation for hidden markov models with intractable likelihoods. Methodology and Computing
in Applied Probability, pages 1–35, 2013.
G. Ewing and J. Hermisson. MSMS: a coalescent simulation program including
recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16):2064–2065, 2010.
P. Fearnhead and D. Prangle. Constructing summary statistics for approximate
Bayesian computation: Semi-automatic approximate Bayesian computation.
Journal of the Royal Statistical Society, Series B (Statistical Methodology),
74:419–474, 2012.
J.-D. Fermanian and B. Salanié. A nonparametric simulated maximum likelihood estimation method. Econometric Theory, 20(4):701–734, 2004.
R. C. Griffiths and S. Tavaré. Ancestral inference in population genetics. Statistical Science, 9(3):307–319, 1994.
D. Gross, J. J. Shortle, J. M. Thompson, and C. M. Harris. Fundamentals of
queueing theory. Wiley, 4th edition, 2008.
R. N. Gutenkunst, R. D. Hernandez, S. H. Williamson, and C. D. Bustamante.
Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics, 5(10), 2009.
W. Härdle, M. Müller, S. Sperlich, and A. Werwatz. Nonparametric and semiparametric models. Springer Series in Statistics. Springer, 2004.
K. Heggland and A. Frigessi. Estimating functions in indirect inference. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):447–
462, 2004.
J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications,
13:235–248, 1982a.
J. F. C. Kingman. Exchangeability and the evolution of large populations.
In G. Koch and F. Spizzichino, editors, Exchangeability in Probability and
Statistics, pages 97–112. North-Holland, Amsterdam, 1982b.
23
J. F. C. Kingman. On the genealogy of large populations. Journal of Applied
Probability, 19:27–43, 1982c.
J. F. C. Kingman. The first Erlang century – and the next. Queueing Systems,
63:3–12, 2009.
D. P. Locke, L. W. Hillier, W. C. Warren, K. C. Worley, L. V. Nazareth, D. M.
Muzny, S.-P. Yang, Z. Wang, A. T. Chinwalla, P. Minx, et al. Comparative
and demographic analysis of orang-utan genomes. Nature, 469:529–533, 2011.
X. Ma, J. L. Kelly, K. Eilertson, S. Musharoff, J. D. Degenhardt, A. L. Martins,
T. Vinar, C. Kosiol, A. Siepel, R. N. Gutenkunst, and C. D. Bustamante.
Population genomic analysis reveals a rich speciation and demographic history
of orang-utans (Pongo pygmaeus and Pongo abelii ). PLOS ONE, 8, 2013.
P. Marjoram and S. Tavaré. Modern computational approaches for analysing
molecular genetic variation data. Nature Reviews Genetics, 7:759–770, 2006.
T. McKinley, A. R. Cook, and R. Deardon. Inference in epidemic models without
likelihoods. The International Journal of Biostatistics, 5, 2009.
L. N. Naduvilezhath, L. E. Rose, and D. Metzler. Jaatha: a fast compositelikelihood approach to estimate demographic parameters. Molecular Ecology,
20:2709–2723, 2011.
M. Nordborg. Coalescent theory. In D. J. Balding, M. Bishop, and C. Cannings, editors, Handbook of Statistical Genetics, volume 2, pages 843–877.
John Wiley & Sons, third edition, 2007.
J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. Population growth of human Y chromosomes: A study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798, 1999.
R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2013. URL http:
//www.R-project.org/.
F. J. Rubio and A. M. Johansen. A simple approach to maximum intractable
likelihood estimation. Electronic Journal of Statistics, 7:1632–1654, 2013.
P. Sadegh. Constrained optimization via stochastic approximation with a simultaneous perturbation gradient approximation. Automatica, 33(5):889–892,
1997.
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.
S. Soubeyrand, F. Carpentier, N. Desassis, and J. Chadœuf. Inference with
a contrast-based posterior distribution and application in spatial statistics.
Statistical Methodology, 6(5):466–477, 2009.
24
J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37
(3):352–355, 1992.
J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation,
Simulation and Control. Wiley, 2003.
M. Stephens. Inference under the coalescent. In D. J. Balding, M. Bishop,
and C. Cannings, editors, Handbook of Statistical Genetics, volume 2, pages
878–908. John Wiley & Sons, third edition, 2007.
A. Tellier, P. Pfaffelhuber, B. Haubold, L. Naduvilezhath, L. E. Rose, T. Städler,
W. Stephan, and D. Metzler. Estimating parameters of speciation models
based on refined summaries of the joint site-frequency spectrum. PLOS ONE,
6(5), 05 2011.
T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf. Approximate
Bayesian computation scheme for parameter inference and model selection in
dynamical systems. Journal of the Royal Society Interface, 6:187–202, 2009.
J. Wakeley. Coalescent Theory. Roberts & Company Publishers, 2008.
M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall, 1995.
D. Wegmann, C. Leuenberger, and L. Excoffier. Efficient approximate Bayesian
computation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):1207–1218, 2009.
G. Weiss and A. von Haeseler. Inference of population history using a likelihood
approach. Genetics, 149(3):1539–1546, 1998.
S. Wellek. Testing Statistical Hypotheses of Equivalence and Noninferiority.
CRC Press, Taylor & Francis, 2010.
Y. Wu. Exact computation of coalescent likelihood for panmictic and subdivided populations under the infinite sites model. IEEE/ACM Transactions
on Computational Biology and Bioinformatics, 7(4):611–618, 2010.
25