Academia.eduAcademia.edu

Approximate Maximum Likelihood Estimation

1991, Order Statistics and Inference

AI-generated Abstract

In the realm of statistical inference, traditional approaches face significant challenges when dealing with complex models lacking explicit likelihood functions. This paper introduces a novel methodology employing stochastic gradient techniques, which facilitates maximum likelihood estimation within both Bayesian and frequentist frameworks. By leveraging iterative optimization and random initialization, the proposed algorithm aims to efficiently navigate high-dimensional parameter spaces, demonstrating its effectiveness through application to queuing systems and the demographic analysis of orang-utan populations.

Approximate Maximum Likelihood Estimation arXiv:1507.04553v1 [stat.CO] 16 Jul 2015 Johanna Bertl∗ Aarhus University, Denmark Gregory Ewing École polytechnique fédérale de Lausanne, Switzerland Carolin Kosiol Andreas Futschik Vetmeduni Vienna, Austria Johannes Kepler University Linz, Austria July 17, 2015 Abstract In recent years, methods of approximate parameter estimation have attracted considerable interest in complex problems where exact likelihoods are hard to obtain. In their most basic form, Bayesian methods such as Approximate Bayesian Computation (ABC) involve sampling from the parameter space and keeping those parameters that produce data that fit sufficiently well to the actually observed data. Exploring the whole parameter space, however, makes this approach inefficient in high dimensional problems. This led to the proposal of more sophisticated iterative methods of inference such as particle filters. Here, we propose an alternative approach that is based on stochastic gradient methods and applicable both in a frequentist and a Bayesian setting. By moving along a simulated gradient, the algorithm produces a sequence of estimates that will eventually converge either to the maximum likelihood estimate or to the maximum of the posterior distribution, in each case under a set of observed summary statistics. To avoid reaching only a local maximum, we propose to run the algorithm from a set of random starting values. As good tuning of the algorithm is important, we explored several tuning strategies, and propose a set of guidelines that worked best in our simulations. We investigate the performance of our approach in simulation studies, and also apply the algorithm to two models with intractable likelihood functions. First, we present an application to inference in the context of queuing systems. We also re-analyze population genetic data and estimate parameters describing the demographic history of Sumatran and Bornean orang-utan populations. ∗ [email protected] 1 1 Introduction Both in the Bayesian as well as in the frequentist framework, statistical inference commonly uses the likelihood function. However, under a wide range of complex models, no explicit formula is available. This occurs for example when modelling population genetic and evolutionary processes (Marjoram and Tavaré, 2006), in spatial statistics (Soubeyrand et al., 2009), and in queuing systems (Heggland and Frigessi, 2004). Furthermore, such situations also occur with dynamical systems as used for example in systems biology (Toni et al., 2009), and epidemiology (McKinley et al., 2009). Here, we consider statistical models with an intractable distribution theory, but a known data generating process under which simulations can be obtained. A recent approach to overcome this problem is Approximate Bayesian Computation (ABC) (Beaumont et al., 2002). Its basis is a rejection algorithm: Parameter values are randomly drawn from the prior distribution, data sets are then simulated under these values. To reduce complexity, informative but low dimensional summaries are derived from the data sets. All parameter values that gave rise to summary statistics similar to those computed for the observed data are then accepted as a sample from the posterior distribution. Hence, ABC does not involve any kind of direct approximation of the likelihood, but with a uniform prior (with support on a compact subset of the parameter space taken large enough to contain the maximum likelihood estimator) it can be used to obtain a simulation-based approximation of the likelihood surface and, subsequently, the maximum likelihood estimate for the observed summary statistic values (Creel and Kristensen, 2013; Rubio and Johansen, 2013). ABC has successfully been used in population genetic applications and also in other fields (Beaumont, 2010). However, the sampling scheme can be inefficient in high-dimensional problems: The higher the dimension of the parameter space and the summary statistics, the lower is the acceptance probability of a simulated data set. Consequently, more data sets need to be simulated to achieve a good estimate of the posterior distribution. Alternatively, the acceptance threshold needs to be relaxed, but this increases the bias in the approximation to the posterior distribution. To reduce the sampling space, combinations of ABC and iterative Monte Carlo methods such as particle filters and Markov Chain Monte Carlo have been proposed (Beaumont et al., 2009; Wegmann et al., 2009). Here, we follow an alternative approach to obtain an approximate maximum likelihood estimate. Instead of using random samples from the whole parameter space, we propose a stochastic gradient algorithm that approximates the maximum likelihood estimate. Similar to ABC, it relies on lower-dimensional summary statistics of the data. In the simultaneous perturbations algorithm, in each iteration two noisy evaluations of the likelihood are sufficient to obtain an ascent direction (Spall, 1992). To this end the likelihood is approximated by kernel density estimation on summary statistics of data simulated under the parameter value of interest. Our algorithm is related to a method suggested in Diggle and Gratton (1984) 2 (see also Fermanian and Salanié, 2004). There, an approximate maximum likelihood estimate is obtained using a stochastic version of the Nelder-Mead algorithm. However, the authors explore only applications to 1-dimensional i.i.d. data. In principle, our approach can also be applied in the context of indirect inference, where summary statistics are derived from a tractable auxiliary model (Drovandi et al., 2011). 2 Method We will start by describing the so called simultaneous perturbation algorithm for optimizing the expected value of a random function using simulations. Next, we explain how this algorithm can be adapted to obtain maximum likelihood estimates. Details on the tuning can be found in the following section. 2.1 General simultaneous perturbation algorithm Let Y ∈ R be a random variable depending on Θ ∈ Rp . The function L(Θ) = E(Y | Θ) shall be maximized in Θ, but L(Θ) as well as the gradient ∇L(Θ) are unknown. If a realization y(Θ) of Y | Θ can be observed for any value of Θ, the maximizer arg maxΘ∈Rp L(Θ) can be approximated by the simultaneous perturbation algorithm (Spall, 1992). Similar to a deterministic gradient algorithm, it is based on the recursion Θk = Θk−1 + ak ∇L(Θk−1 ), where ak ∈ R+ is a decreasing sequence. However, as ∇L(Θk−1 ) is unknown, it is substituted by a rescaled putative ascent direction that is randomly chosen among the vertices of the unit hypercube. Thus, in iteration k, a random vector δk with elements  −1 with probability 1/2, (i) δk = (1) +1 with probability 1/2, for i = 1, . . . , p is generated, and the ascent direction is approximated by δk y (Θk + ck δk ) − y (Θk − ck δk ) , 2ck with ck ∈ R+ being a decreasing sequence. 2.2 Approximate maximum likelihood algorithm Suppose, data Dobs are observed under model M with unknown parameter vector Θ ∈ Rp . Let L(Θ; Dobs ) = p(Dobs | Θ) denote the likelihood of Θ. For complex models often there is no closed form expression for the likelihood, and for high dimensional data sets, the likelihood can be difficult to estimate. As in ABC, we therefore consider L(Θ; Sobs ) = p(Sobs | Θ), an approximation to the 3 original likelihood that uses a d-dimensional vector of summary statistics Sobs instead of the original data Dobs . An estimate L̂(Θ; Sobs ) of L(Θ; Sobs ) can be obtained from simulations under M(Θ) using kernel density estimation. By setting y(Θ) = L̂(Θ; Sobs ), we adopt the simultaneous perturbation algorithm to approximate the maximum likelihood estimate Θ̂ML of Θ. In more detail, the algorithm is as follows: Algorithm (AML). Let ak , ck ∈ R+ be two decreasing sequences. Let Hk be a sequence of symmetric positive definite d × d matrices and κ a d-dimensional R kernel function satisfying Rd κ(x)dx = 1. Choose a starting value Θ0 ∈ Rp . For k = 1, 2, . . . , K: 1. Choice of a likely ascent direction in Θk−1 : (a) Generate a random vector δk as defined in equation (1). (b) Simulate datasets D1− , . . . , Dn− from M(Θ− ) and D1+ , . . . , Dn+ from M(Θ+ ) with Θ± = Θk−1 ± ck δk . (c) Compute summary statistics Sj− on dataset Dj− and Sj+ on Dj+ for j = 1, . . . , n. (d) Estimate the likelihood L̂(Θ− ; Sobs ) = p̂(Sobs | Θ− ) and L̂(Θ+ ; Sobs ) = p̂(Sobs | Θ+ ) from the summary statistics S1− , . . . , Sn− and S1+ , . . . , Sn+ , respectively, with multivariate kernel density estimation (Wand and Jones, 1995): X  −1/2  1 κ Hk Sobs − Sj− L̂(Θ− ; Sobs ) = p n det(Hk ) j=1 n and analogously for L̂(Θ+ , Sobs ). (e) Estimate the ascent direction ∇l(Θk−1 ; Sobs ) by + − ˆ c ˆl(Θk−1 ; Sobs ) = δk log L̂(Θ ; Sobs ) − log L̂(Θ ; Sobs ) ∇ k 2ck 2. Updating Θk : ˆ c ˆl(Θk−1 ; Sobs ) Θk = Θk−1 + ak ∇ k Then, the approximate maximum likelihood (AML) estimate is Θ̂AML := ΘK . If, in a Bayesian setting, a prior distribution π(Θ) has been specified, the algorithm can be modified to approximate the maximum of the posterior distribution, Θ̂MAP , by multiplying L̂(Θ− , Sobs ) and L̂(Θ+ , Sobs ) by π(Θ− ) and π(Θ+ ), respectively. 4 2.3 Parametric bootstrap Confidence intervals and estimates of the bias and standard error of Θ̂AML can be obtained by parametric bootstrap: B bootstrap datasets are simulated from the model M(Θ̂AML ) and the AML algorithm is run on each dataset to obtain the bootstrap estimates Θ̂∗AML,1 , . . . , Θ̂∗AML,B . This sample reflects both the error of the maximum likelihood estimator as well as the approximation error. We compute simple bootstrap confidence intervals that are based on the assumption that the distribution of Θ̂AML − Θ can be approximated sufficiently well by the distribution of Θ̂∗AML − Θ̂AML , where Θ̂∗AML is the bootstrap estimator. Then, a two-sided (1 − α)-confidence interval is defined as h i 2Θ̂AML − q(1−α/2) (Θ̂∗AML ), 2Θ̂AML − q(α/2) (Θ̂∗AML ) , where q(β) (Θ̂∗AML ) denotes the β quantile of Θ̂∗AML,1 , . . . , Θ̂∗AML,B (Davison and Hinkley, 1997). 3 Tuning guidelines The performance of the algorithm strongly depends on the choice of the sequences ak and ck . However, their optimal values depend on the unknown likelihood function. Another challenge in the practical application is the stochasticity: large steps can by chance lead very far away from the maximum. 3.1 Choice of ak and ck and convergence diagnostics Here, we consider sequences of the form ak = c a α and ck = γ k (k + A) as proposed in Spall (2003). To reflect the varying slope of L in different directions of the parameter space, we choose a ∈ Rp (this is equivalent to scaling the space accordingly). Choosing α = 1 and γ = 1/6 ensures optimal convergence speed (Spall, 2003). The optimal choice of a and c depends on the unknown shape of L. Therefore, we propose the following heuristic based on suggestions in Spall (2003) and our own experience to determine these values as well as A ∈ N, a shift parameter to avoid too fast decay of the step size in the first iterations. Let K be the number of planned iterations and b ∈ Rp a vector that gives the desired stepsize in early iterations in each dimension. Choose a starting value Θ0 . 1. Set c to a small percentage of the total parameter range (usually between 1% and 5%) to obtain c1 . 2. Set A = ⌊0.1 ∗ K⌋. 5 3. Choose a: (a) Estimate ∇l(Θ0 ; Sobs ) by the median of n1 finite differences approx¯ˆ ˆ imations (step 1 of the AML algorithm), ∇ c1 l(Θ0 , Sobs ). (b) Set a(i) =  b(i) (A + 1)α ¯ˆ ˆ ∇ c1 l(Θ0 ; Sobs ) (i) for i = 1, . . . , p. As a is determined using information about the likelihood in Θ0 only, it might not be adequate in other regions of the parameter space. To be able to distinguish convergence from a too small step size, we simultaneously monitor the growth of the likelihood function and the trend in the parameter estimates to adjust a if necessary. Every K0 iterations the following three tests are conducted on the preceding K0 iterates: • Trend test (too small a): For each dimension i = 1, . . . , p a trend in (i) Θk is tested using the standard random walk model (i) (i) Θk = Θk−1 + β + ǫk , where β denotes the trend and ǫk ∼ N (0, σ 2 ). The null hypothesis β = 0 (i) (i) can be tested by a t-test on the differences ∆k = Θk − Θk−1 . If a trend is detected, a(i) is increased by a fixed factor f ∈ R+ . • Range test (too large a): For each dimension i = 1, . . . , p, a(i) is set (i) to a(i) /f if the trajectory of Θk spans more than 70% of the parameter range. • Convergence test: Simulate n2 likelihood estimates at Θk−K0 and at Θk . Growth of the likelihood is then tested by a one-sided Welch’s t-test. (A standard hypothesis test; testing for equivalence could be used instead, see Wellek (2010).) We conclude that the algorithm has converged to a maximum only if the convergence test did not reject the null hypothesis three times in a row and at the same time no adjustments of a were necessary. In the applications of the algorithm presented in the article, f = 1.5. 3.2 Kernel density estimation To enable the estimation of the gradient even far away from the maximum, a kernel function with infinite support is helpful. The Gaussian kernel is an obvious option, but the high rate of decay can by chance cause very large steps 6 leading away from the maximum. Therefore, we use the following modification of the Gaussian kernel:  ( exp − 12 x′ H −1 x  if x′ H −1 x < 1 1/2 √ κ(H x) ∝ otherwise. exp − 12 x′ H −1 x In degenerate cases where the likelihood evaluates to zero numerically, we replace the classical kernel density estimate by a nearest neighbor estimate in step 1d: If L̂(Θ− ; Sobs ) ≈ 0 or/and L̂(Θ+ ; Sobs ) ≈ 0 (with “≈” denoting “numerically equivalent to”), find  − := arg min Sj− − Sobs : j = 1, . . . , n Smin Sj− + := arg min Smin + Sj  Sj+ − Sobs : j = 1, . . . , n and recompute the kernel density estimate L̂(Θ− ; s) and L̂(Θ+ ; s) in step 1d + − , respectively, as the only observation. and Smin using Smin To be computationally efficient, we estimate a diagonal bandwidth matrix Hk using a multivariate extension of Silverman’s rule (Silverman, 1986; Härdle et al., 2004). Using new estimates in each iteration introduces an additional level of noise that can be reduced by using a moving average of bandwidth estimates. 3.3 Starting points To avoid starting in very unfavourable regions of the likelihood, a set of random points should be drawn from the parameter space and their simulated likelihood values compared to find useful starting points. This strategy also helps to avoid that the algorithm reaches only a local maximum. 3.4 Constraints The parameter space will usually be subject to constraints (e.g., rates are positive quantities). They can be incorporated by projecting the iterate to the closest point such that both Θ− as well as Θ+ are in the feasible set (Sadegh, 1997). Even if there are no imperative constraints, it is advisable to restrict the parameter space to a range of plausible values to prevent the algorithm from trailing off at random in regions of very low likelihood. To reduce the effect of large steps within the boundaries, we clamp the step size at 10% of the range of the feasible set in each dimension. 7 4 Examples To study the performance of the AML algorithm, we test it on different applications. The first example is the multivariate normal distribution. While there is no need for simulation based inference for normal models, it allows us to compare the properties of the AML estimator and the maximum likelihood estimator. Second, we apply the algorithm to a queuing process where the exact evaluation of the likelihood can require a prohibitive number of steps. The third example is from population genetics. Here, we use both simulated data, where the true parameter values are known, as well as DNA sequence data from a sample of Bornean and Sumatran orang-utans and estimate parameters of their ancestral history. 4.1 Multivariate normal distribution When the AML algorithm is used to estimate the mean vector Θ = (µ1 , . . . , µ10 ) of a 10-dimensional normal distribution with a diagonal VC matrix using the arithmetic means X̄ = (X̄1 , . . . , X̄10 ) as summary statistics, convergence is achieved quickly and the results are extremely accurate. One dataset is simulated under a 10-dimensional normal distribution such that the maximum likelihood estimator for Θ is Θ̂ML = X̄ ∼ N (5 · 110 , I10 ) where I10 denotes the 10-dimensional identity matrix and 110 the 10-dimensional vector with 1 in each component. To estimate the distribution of Θ̂AML , the AML algorithm is run 1000 times on this dataset with summary statistics S = X̄. For each AML estimate, 1000 points are drawn randomly on (−100, 100) × . . . × (−100, 100). For each of them, the likelihood is simulated and the 5 points with the highest likelihood estimate are used as starting points. On each of them, the AML algorithm is run for at least 10000 iterations and stopped as soon as convergence is reached (for ≈ 90% of the sequences within 11000 iterations; convergence is tested every K0 = 1000 iterations). Again, the likelihood is simulated on each result and the one with the highest likelihood is considered as a realization of Θ̂AML . For each evaluation of the likelihood, n = 100 datasets are simulated. Based on these 1000 realizations of Θ̂AML , the density, bias and standard error of Θ̂AML are estimated for each dimension (tab. 1, fig. 1). The densities of the 10 components of Θ̂AML are symmetric around the maximum likelihood estimate with a standard error that is nearly 20 times smaller than the standard error of Θ̂ML and a negligible bias. Bootstrap confidence intervals that were obtained from B = 100 bootstrap samples for 100 datasets simulated under the same model as above show a close approximation to the intended coverage probability of 95%. To investigate the impact of the choice of summary statistics on the results, we repeat the experiment with the following set of summary statistics: 8 X̄1 X̄2 + X̄3 X̄2 − X̄3 X̄4 + X̄5 X̄5 + X̄6 X̄6 + X̄4 X̄7 X̄7 + X̄8 X̄9 X̄9 · X̄10 (2) For ≈ 90% of the sequences, convergence was detected within 14000 iterations. Bootstrap confidence intervals are obtained for 100 datasets. Their coverage probability matches the nominal 95% confidence level closely (tab. 1). The components behave very similarly to the previous simulations, except for the estimates of the density for components 9 and 10 (fig. 2). Compared to the simulations with S = X̄, the bias of components 9 and 10 is considerably increased, but it is still much smaller than the standard error of Θ̂ML . To investigate how fast the bias decreases with the number of iterations, we re-run the above described algorithm for 100000 iterations without earlier stopping (fig. 3). Both bias and standard error decrease with the number of iterations. Table 1: Properties of Θ̂AML in the 10-dimensional normal distribution model using 2 different sets of summary statistics S = X̄ dim. 1 2 3 4 5 6 7 8 9 10 b̂ -0.0016 0.0017 0.0004 -0.0033 -0.0015 -0.0000 0.0017 -0.0007 -0.0013 0.0001 se b 0.0546 0.0544 0.0547 0.0567 0.0584 0.0565 0.0557 0.0559 0.0554 0.0554 S as in eq. 2 p̂ 93 94 94 90 95 94 96 95 91 99 b̂ -0.0007 -0.0002 0.0007 -0.0032 -0.0000 0.0044 0.0035 -0.0030 0.0809 -0.0595 se b 0.0660 0.0638 0.0635 0.0789 0.0748 0.0757 0.0686 0.1140 0.0658 0.0922 p̂ 93 93 97 98 90 90 95 96 98 92 Bias (b̂) and standard error (se) b of Θ̂AML , estimated from 1000 runs of the AML algorithm. Coverage probability of bootstrap 95% confidence intervals (p̂) with B = 100, estimated from 100 datasets. 4.2 Queueing process The theory of queueing processes is a very prominent and well-studied field of applied probability with applications in many different areas like production management, communication and information technology, and health care (Kingman, 2009; Gross et al., 2008). Here, we consider a first-come-first-serve queuing process with a single server, a Markovian distribution of interarrival times of customers and a general distribution of service times (M/G/1). It is modelled by the following sequences of random variables: The service time 9 6 7 3.9 4.0 5.8 5.9 6.0 6.1 3.9 4.0 4.1 5.9 6.0 6.1 5.0 5.1 ^ (6) µ AML 2 0 3.7 3.8 3.9 5.2 5.3 5.4 4.0 4.1 7 6 5 4 3 2 5.1 5.2 5.2 5.3 5.3 5.4 5.5 ^ (5) µ AML 1 5.0 ^ (7) µ AML 5.1 ^ (4) µ AML 0 1 2 3 4 5 6 7 6 2 0 3.8 5.8 ^ (3) µ AML 4 6 4 2 0 3.7 0 5.7 ^ (2) µ AML 0 1 2 3 4 5 6 7 3.8 ^ (1) µ AML 0 3.7 0 0 1 1 2 2 2 3 3 4 4 4 4 5 5 6 6 6 6 5 4 3 2 1 0 3.6 5.4 4.5 4.6 ^ (8) µ AML 4.7 4.8 2.9 3.0 ^ (9) µ AML 3.1 3.2 3.3 ^ (10) µ AML 5 4 3.5 3.6 3.7 3.8 3.9 4.0 3 3 4.1 5.8 5.9 6.0 6.1 6.2 5.8 5.9 6.0 6.1 6.2 2 0 0 5.7 ^ (2) µ AML 3.6 3.7 3.8 ^ (3) µ AML 3.9 4.0 4.1 4.2 5.0 5.1 ^ (4) µ AML 5.2 5.3 5.4 5.5 5.6 ^ (5) µ AML 4 5 3.0 3 4 3.6 3.7 3.8 3.9 ^ (6) µ AML 4.0 4.1 4.2 4.9 5.0 5.1 ^ (7) µ AML 5.2 5.3 5.4 0 4.8 5.0 5.2 5.4 ^ (8) µ AML 5.6 5.8 0 1 1 2 1.0 0.0 0 0 1 1 2 2 2 3 2.0 3 3 4 4 5 5 6 ^ (1) µ AML 1 1 2 1 0 0 0 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 5 Figure 1: Density of the components of Θ̂AML obtained with S = X̄ in one dataset estimated from 1000 converged sequences with a miminum length of 10000 iterations by kernel density estimation. Vertical dashed line: Θ̂ML . 4.3 4.4 4.5 4.6 ^ (9) µ AML 4.7 4.8 3.0 3.2 3.4 3.6 ^ (10) µ AML Figure 2: Density of the components of Θ̂AML obtained with S as in eq. (2) in one dataset estimated from 1000 converged sequences with a miminum length of 10000 iterations by kernel density estimation. Vertical dashed line: Θ̂ML . 10 0.075 ^ ^ (9) b (µ AML) ^ ^ (10) b (µ AML) ^ (9) ) se(µ AML (10) 0.045 0.055 0.065 ^ ) se(µ AML 20000 40000 60000 80000 100000 Iteration Figure 3: Absolute bias and standard error of the AML estimator of µ9 and µ10 using the summary statistics in eq. (2) estimated from 1000 runs of the algorithm on the same dataset. Um , the interarrival time Wm and the interdeparture time Ym for customer m = 1, . . . , M (Heggland and Frigessi, 2004). Um and Wm are independent and their distributions are known except for the parameters. Here we assume Um ∼ U ([θ1 , θ1 + θ2 ]) and Wm ∼ exp(θ3 ). Ym is defined as  m m−1 P P   if Wi ≤ Yi ,  Um i=1 i=1 Ym = m m−1 P P    Um + Wi − Yi otherwise i=1 i=1 for m = 1, . . . , M . If only the interdeparture times Ym are observed, the evaluation of the likelihood of the parameter vector Θ = (θ1 , θ2 , θ3 ) requires an exponential number of steps in M (Heggland and Frigessi, 2004). Approximate inference has been conducted with indirect inference (Heggland and Frigessi, 2004) and ABC methods (Blum and François, 2010). To estimate the distribution of Θ̂AML in this model, we simulate 100 datasets of size M = 100 under the same parameter value Θ = (1, 4, 0.2) and compute Θ̂AML for each of them. The summary statistics are the minimum, maximum and the quartiles of Y1 , . . . , YM (Blum and François, 2010). To estimate the standard error of Θ̂AML per dataset, we re-run the algorithm five times on each dataset. Each time, we randomly draw 100 starting values from the search space (0, 10) × (0, 10) × (0.05, 10) and run the AML algorithm on the best five starting values for 5000 iterations (runtime of an implementation in R, version 3.0.1 R Core Team (2013): 1.9 hours on a single core). Finally, we estimate the 11 likelihood on the 5 results and the best one is used as a realization of Θ̂AML . Each likelihood estimation is based on n = 100 simulations. For 3 datasets, the AML algorithm is run 100 times to obtain kernel density estimates of the density of Θ̂AML . 0 θ1 2 4 Table 2: Properties of Θ̂AML in the M/G/1 queue 0.0 θ2 0.4 0.8 0.2 0.4 0.6 0.8 1.0 1.2 4 6 8 0 θ3 2 40 80 0 0.15 0.20 0.25 0.30 true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se 1 (0, 10) 0.942 0.948 -0.058 0.062 0.098 5 (0, 10) 5.031 4.891 0.031 0.437 1.046 0.2 (0.05, 10) 0.229 0.226 0.029 0.004 0.024 Figures: marginal densities of components of Θ̂AML , estimated by kernel density estimation. Black solid line: density of Θ̂AML estimated over 100 simulated data sets. Coloured dotted and dashed lines: density of Θ̂AML estimated separately in 3 example datasets. Vertical black line: true parameter value. Summaries: true: true parameter value; space: search space; mean: mean of Θ̂AML ; median: median of Θ̂AML ; bias: bias of Θ̂AML ; mean se: mean standard error of Θ̂AML per dataset; total se: standard error of Θ̂AML . In all three dimensions, the standard error of Θ̂AML is considerably larger across datasets than within dataset (tab. 2). This indicates that the approximation algorithm only adds a small additional error to the error of Θ̂ML . In all dimensions the density of Θ̂AML is slightly asymmetric around the true value, but only the bias of θ̂3 is of the same order as the standard error. It is probably caused by a strongly skewed likelihood function (Blum and François, 2010) that emphasises the error of the finite differences approximation to the slope. Additional simulations of the likelihood on a grid between θ̂3 and θ3 locate the maximum around 0.209. Re-running the algorithm for K = 10000 iterations 12 reduces the bias from 0.029 to 0.024. The parameter estimate with the largest standard error is θ̂2 which is also the most difficult parameter to estimate with ABC (Blum and François, 2010). 4.3 Population genetics The goal of population genetics is to understand how evolutionary forces like mutation, recombination, selection and genetic drift shape the variation within a population. A major challenge in the application of population genetic theory is to infer parameters of the evolutionary history of a population from DNA sequences of present-day samples only. The questions range from estimating single parameters like the mutation rate or the recombination rate to determining a complex demographic structure with multiple parameters. Since the introduction of the coalescent by Kingman in the early eighties (Kingman, 1982a,b,c), a very elegant and flexible stochastic process is at hand to model the ancestral history of a population. The coalescent can capture population structure, varying population size and different mating schemes. Extensions to incorporate recombination and selection exist as well (Nordborg, 2007; Wakeley, 2008). However, statistical inference on data obtained from a coalescent process is difficult: Under the coalescent, computing the likelihood of a parameter vector is a computationally very expensive task even for a single locus, since all possible ancestral trees need to be taken into account (Stephens, 2007). The number of binary tree topologies explodes with the number of sampled individuals, n, e. g. with n = 10, the number of possible topologies is 2, 571, 912, 000, with n = 1000, it is already 3.01 · 104831 (Wakeley, 2008, p. 83, table 3.2). Consequently, exact computation of the likelihood is feasible in reasonable time for a very restricted set of models and a small number of haplotypes only (Griffiths and Tavaré, 1994; Wu, 2010). To make use of the large DNA data sets generated by modern sequencing technology, approximate methods have been developed in frequentist as well as in Bayesian frameworks. In fact, the need for inferential tools that allow for large genomic data sets was a major driver in the development of ABC. The focus of the early ABC papers was clearly on population genetic applications (Weiss and von Haeseler, 1998; Pritchard et al., 1999; Beaumont et al., 2002) and since then it has found a large variety of applications in the field of population genetics (Csilléry et al., 2010; Beaumont, 2010; Bertorelle et al., 2010). 4.3.1 Population genetic history of orang-utans Pongo pygmaeus and Pongo abelii, Bornean and Sumatran orang-utans, respectively, are Asian great apes whose distributions are exclusive to the islands of Borneo and Sumatra. Recurring glacial periods that led to a cooler, drier and more seasonal climate might have contracted rain forest and isolated the population of orang-utans. At the same time, the sea level dropped and land bridges among islands created opportunities for migration among previously 13 isolated populations. However, whether glacial periods have been an isolating or a connecting factor remains poorly understood. Therefore, there has been a considerable interest in using genetic data to understand the demographic history despite the computational difficulties involved in such a population genetic analysis. We will compare our results to the analysis of the orang-utan genome paper (Locke et al., 2011) and a more comprehensive study by Ma et al. (Ma et al., 2013); the only analyses that have been performed genome-wide. Both studies use DaDi (Gutenkunst et al., 2009), a state-of-art software that has been widely used for demographic inference. It is based on the diffusion approximation to the coalescent. The orang-utan SNP data consisting of 4-fold degenerate (synonymous) sites are taken from the 10 sequenced individuals (5 individuals each per orang-utan population, haploid sample size 10, Locke et al., 2011). As in Locke et al. (2011) and Ma et al. (2013), we consider an IsolationMigration (IM) model where a panmictic ancestral population of effective size NA splits τ years ago into two distinct populations of constant effective size NB (the Bornean population) and NS (the Sumatran population) with backward migration rates µBS (fraction of the Bornean population that is replaced by Sumatran migrants per generation) and µSB (vice versa; fig. 4a). n n-1 NA n-2 n-3 n-4 deme 1... time 4 3 τ µBS 2 1 0 NB Borneo µSB NS Sumatra 0 (a) 1 2 3 . . . n-4 n-3 n-2 n-1 deme 2 4 n (b) Figure 4: (a) Isolation-migration model for the ancestral history of orangutans. NA , effective size of the ancestral population; µBS (µSB ), fraction of the Bornean (Sumatran) population that is replaced by Sumatran (Bornean) migrants per generation (backwards migration rate); τ , split time in years; NB (NS ), effective population size in Borneo (Sumatra). (b) Binned joint site frequency spectrum (adapted from Naduvilezhath et al., 2011). NA is set to the present effective population size that we obtain using the number of SNPs in our considered data set and assuming an average per gener14 ation mutation rate per nucleotide of 2 · 10−8 and a generation time of 20 years (Locke et al., 2011), so NA = 17400. There are no sufficient summary statistics at hand, but for the IM model the joint site frequency spectrum (JSFS) between the two populations was reported to be a particularly informative summary statistic (Tellier et al., 2011). However, for N samples in each of the two demes, the JSFS has (N + 1)2 − 2 entries, so even for small datasets it is very high-dimensional. To reduce this to more manageable levels we follow Naduvilezhath et al. (2011) and bin categories of entries (fig. 4b). As the ancestral state is unknown, we use the folded binned JSFS that has 28 entries. To incorporate observations of multiple unlinked loci mean and standard deviation across loci are computed for each bin, so the final summary statistics vector is of length 56. The AML algorithm with the JSFS as summary statistics was implemented as an extra module in the msms package, a coalescent simulation program (Ewing and Hermisson, 2010). This allows for complex demographic models and permits fast summary statistic evaluation without using external programs and the associated performance penalties. 4.3.2 Simulations Before applying the AML algorithm to the actual orang-utan DNA sequences, we tested it on simulated data. Under the described IM model with parameters NB = NS = 17400, µBS = µSB = 1.44 · 10−5 and τ = 695000, we simulated 25 datasets with 25 haploid sequences per deme, each of them consisting of 75 loci with 130 SNPs each. For each dataset, 25 AML estimates were obtained with the same scheme: 1000 random starting points were drawn from the parameter space; the likelihood was estimated with n = 40 simulations. Then, the 5 values with the highest likelihood estimates were used as starting points for the AML algorithm. The algorithm converged after 3000-25000 iterations (average: ≈ 8000 iterations; average runtime of msms: 11.7 hours on a single core). For the migration rates µSB and µBS , the per dataset variation of the estimates is considerably smaller than the total variation (tab. 3). This suggests that the approximation error of the AML algorithm is small in comparison to the error of Θ̂ML . For the split time τ and the population sizes NS and NB , the difference is less pronounced, but still apparent. For all parameters, the average bias of the estimates is either smaller or of approximately the same size as the standard error. As the maximum likelihood estimate itself cannot be computed, it is impossible to disentangle the bias of Θ̂ML and an additional bias introduced by the AML algorithm. As an alternative measure of performance, we compare the likelihood estimated at the true parameter values and the AML estimate with the highest estimated likelihood on each dataset. Only in 2 of the 25 datasets, the likelihood at the true value is higher, but not significantly so, whereas it is significantly lower in 12 of them (significance was tested with a one-sided Welch test using a Bonferroni-correction to account for multiple testing). This suggests that the AML algorithm usually produces estimates that 15 are closer to the maximum likelihood estimate than the true parameter value is. To investigate the impact of the underlying parameter value on the quality of the estimates, simulation results were obtained also for 25 datasets simulated with τ twice as large. Here, all parameter estimates, especially τ , NB and NS had larger standard errors and biases (tab. 4). Apparently, the estimation problem is more difficult for more distant split times. This can be caused by a flatter likelihood surface and by stronger random noise in the simulations. Only the migration rates are hardly affected by the large τ : a longer divergence time allows for more migration events that might facilitate their analysis. 4.3.3 Real data We model the ancestral history of orang-utans with the same model and use the same summary statistics as tested in the simulations. Also the datasets are simulated in the same manner. To study the distribution of Θ̂AML on this dataset, the algorithm is run 20 times with the same scheme as in the simulations, so we can estimate the standard error of Θ̂AML in this single dataset. The best one is used as the final parameter estimate Θ̂AML and is used for bootstrapping. Confidence intervals are obtained by parametric bootstrap with B = 1000 bootstrap datasets. The bootstrap replicates are also used for bias correction and estimation of the standard error (tab. 5). In Locke et al. (2011) and Ma et al. (2013), parameters of two different IM models are estimated, denote the estimates Θ̂1 and Θ̂2 . Scaled to the ancestral population size NA = 17372, the estimates are shown in tab. 6. Model 1 is identical to the model considered here, so we simulate the likelihood at Θ̂1 within our framework for comparison. Since log L̂(Θ̂1 ) = −217.015 (se = 7.739) is significantly lower than log L̂(Θ̂AML ) = −162.732 (se = 7.258), it seems that Θ̂AML is closer to the maximum likelihood estimate than the competing estimate. Note, however, that we are only using a subset of the data to avoid sites under selection and that the authors report convergence problems of DaDi in this model. For model 2, the ancestral population splits in two subpopulations of size s and 1 − s relative to the ancestral population and the subpopulations experience exponential growth. Here a direct likelihood comparison with the DaDi estimates given in Locke et al. (2011) is impossible. However, a rough comparison shows that the AML estimates for τ , NB and NS lie between Θ̂1 and Θ̂2 and for µBS and µSB they are of similar size. 4.3.4 Orang-utan data set This real data is based on data sets of two publications, De Maio et al. (2013) and Ma et al. (2013). For the first, CCDS alignments of H. sapiens, P. troglodytes and P. abelii (references hg18, panTro2 and ponAbe2) were downloaded from 16 0 µBS 150000 Table 3: Properties of Θ̂AML in the IM model with short divergence time (τ = 695000 years) 2.5e−05 3.5e−05 0 µSB 1.5e−05 150000 5.0e−06 3.5e−05 1500000 2500000 3500000 0.00000 0.00015 500000 NS 2.5e−05 0.0e+00 τ 1.5e−05 1.5e−06 5.0e−06 30000 40000 50000 NB 0.00000 0.00015 20000 10000 20000 30000 40000 50000 Figures and summaries: As in Tab. 2. 17 true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se 1.44e-05 (1.44e-06, 0.000144) 1.3e-05 1.21e-05 -1.43e-06 2e-06 4.48e-06 1.44e-05 (1.44e-06, 0.000144) 1.43e-05 1.39e-05 -7.2e-08 2.19e-06 3.77e-06 695000 (139000, 6950000) 1020000 946000 330000 305000 389000 17400 (1740, 174000) 21700 21000 4370 3070 4100 17400 (1740, 174000) 22400 21800 5070 3160 4550 0 µBS 100000 Table 4: Properties of Θ̂AML in the IM model with long divergence time (τ = 1390000 years) µSB 3e−05 4e−05 1.5e−05 2.5e−05 0e+00 4e−07 5.0e−06 τ 2e−05 0 100000 1e−05 2e+06 4e+06 6e+06 6e−05 0e+00 0e+00 NS NB 30000 50000 0e+00 6e−05 10000 10000 30000 50000 70000 Figures and summaries: As in Tab. 2. 18 true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se true space mean median bias mean se total se 1.44e-05 (1.44e-06, 0.000144) 1.24e-05 1.12e-05 -1.97e-06 3.38e-06 4.99e-06 1.44e-05 (1.44e-06, 0.000144) 1.25e-05 1.17e-05 -1.88e-06 3.28e-06 4.38e-06 1390000 (139000, 6950000) 2360000 2280000 967000 876000 1e+06 17400 (1740, 174000) 25700 24600 8350 6620 7560 17400 (1740, 174000) 26000 25200 8670 6680 7870 Table 5: Parameter estimates for the ancestral history of orang-utans 1 This Θ̂AML Θ̂∗AML se b∗ se b lower upper µBS 5.023e-06 4.277e-06 1.244e-06 1.992e-07 01 6.627e-06 µSB τ NS NB 4.600e-06 1300681 52998 21971 3.806e-06 1402715 52715 22233 9.366e-07 208391 7223 2779 1.066e-07 194868 6083 2963 1.633e-07 715590 31476 13290 4.931e-06 1820852 67118 27426 confidence interval was shifted to the positive range. The original value was (-9.09e-07,5.72e-06). Θ̂AML , approximate maximum likelihood estimate; Θ̂∗AML , bootstrap bias corrected estimate; se b ∗ , bootstrap standard error of Θ̂AML ; se, b standard error of Θ̂AML in this dataset, estimated from 20 replicates of Θ̂AML ; lower and upper limits of the 95% simultaneous bootstrap confidence intervals. All bootstrap results were obtained with B = 1000 bootstrap replicates. The simultaneous 95% confidence intervals are computed following a simple Bonferroni argument, so the coverage probabilities are 99% in each dimension (Davison and Hinkley, 1997). Table 6: Results from (Locke et al., 2011, Tab. S21-1) (same results for Model 2 reported in Ma et al. (2013)), scaled with Ne = 17400. µBS µSB τ NS NB Model 1 9.085e-07 7.853e-07 6948778 129889 50934 Model 2 1.518e-05 2.269e-05 630931 35976 10093 Θ̂AML 5.023e-06 4.600e-06 1300681 52998 21971 Model 1: IM model in figure 4a. Model 2: IM-model where the ancestral population splits in two subpopulations with a ratio of s = 0.503 (estimated) going to Borneo and 1 − s to Sumatra and exponential growth in both subpopulations (Locke et al., 2011, Fig. S21-3). Here, NB and NS are the present population sizes. 19 the UCSC genome browser (http://genome.ucsc.edu). Only CCDS alignments satisfying the following requirements were retained for the subsequent analyses: divergence from human reference below 10%, no gene duplication in any species, start and stop codons conserved, no frame-shifting gaps, no gap longer than 30 bases, no nonsense codon, no gene shorter than 21 bases, no gene with different number of exons in different species, or genes in different chromosomes in different species (chromosomes 2a and 2b in non-humans were identified with human chromosome 2). From the remaining CCDSs (9,695 genes, 79,677 exons) we extracted synonymous sites. We only considered third codon positions where the first two nucleotides of the same codon were conserved in the alignment, as well as the first position of the next codon. Furthermore, orang-utan SNP data for the two (Bornean and Sumatran) populations considered, each with 5 sequenced individuals Locke et al. (2011), were kindly provided by X. Ma and are available online (http://www.ncbi.nlm. nih.gov/projects/SNP/snp_viewTable.cgi?type=contact&handle=WUGSC_SNP&batch_ id=1054968). The final total number of synonymous sites included was 1,950,006. Among them, a subset of 9750 4-fold degenerate synonymous sites that are polymorphic in the orang-utan populations were selected. 5 Discussion In this article, we propose an algorithm to approximate the maximum likelihood estimator in models with an intractable likelihood. Simulations and kernel density estimates of the likelihood are used to obtain the ascent directions in a stochastic approximation algorithm, so it is flexibly applicable to a wide variety of models. Alternative simulation based approximate maximum likelihood methods have been proposed that estimate the likelihood surface from samples from the whole parameter space that are obtained in an ABC like fashion (Creel and Kristensen, 2013; Rubio and Johansen, 2013) or using MCMC (de Valpine, 2004). The maximum likelihood estimator is obtained subsequently by standard numerical optimization. Conversely, our method only uses simulations and estimates of the likelihood along the trajectory of the stochastic approximation algorithm, thereby approximating the maximum likelihood estimator more efficiently. This is along the lines of Diggle and Gratton (1984) where a stochastic version of the Nelder-Mead algorithm is used, but through the use of summary statistics we can drop the restrictive i.i.d. assumption. In the setting of hidden Markov models, Ehrlich et al. (2013) have proposed a recursive maximum likelihood algorithm that also combines ABC methodology with the simultaneous perturbations algorithm. A thorough comparison of the properties of different approximate maximum likelihood methods is beyond the scope of this paper, but the three presented examples show that the algorithm provides fast and reliable approximations to the corresponding maximum likelihood estimators. The population genetics application where we estimate parameters of the evolutionary history of orang-utans demonstrates that very high-dimensional 20 summary statistics (here: 56 dimensions) can be used successfully without any dimension-reduction techniques. Usually, high-dimensional kernel density estimation is not recommended because of the curse of dimensionality (e.g., Wand and Jones, 1995), but stochastic approximation algorithms are explicitly designed to cope with noisy measurements. To this end, we also introduce modifications of the algorithm that reduce the impact of single noisy likelihood estimates. In our experience, this is crucial in settings with a low signal-to-noise ratio. Furthermore, the examples show that the AML algorithm performs well in problems with a high-dimensional and large parameter space: In the normal distribution example, the 10-dimensional maximum likelihood estimate is approximated very precisely even though the search space spans 200 times the standard error of Θ̂ML in each dimension. However, we also observe a bias for a few of the estimated parameters. Partly, for example for one of the parameters of the queuing process, this can be attributed to a bias of the maximum likelihood estimator itself. In addition, it is known that the finite differences approximation to the gradient in the KieferWolfowitz algorithm causes a bias that vanishes only asymptotically (Spall, 2003), and that is possibly increased by the finite-sample bias of the kernel density estimator. In most cases though, the bias is smaller than the standard error of the approximate maximum likelihood estimator and can be made still smaller by carrying out longer runs of the algorithm. As sufficiently informative summary statistics are crucial for the reliability of both maximum likelihood and Bayesian estimates, the quality of the estimates obtained from our AML algorithm will also depend on an appropriate choice of summary statistics. This has been discussed extensively in the context of ABC (Fearnhead and Prangle, 2012; Blum et al., 2013). General results and algorithms to choose a small set of informative summary statistics should carry over to the AML algorithm. In addition to the point estimate, we suggest to obtain confidence intervals by parametric bootstrap. The bootstrap replicates can also be used for bias correction. Resampling in models where the data have a complex internal structure catches both the noise of the maximum likelihood estimator as well as the approximation error. Alternatively, the AML algorithm may also complement the information obtained via ABC in a Bayesian framework: the location of the maximum a posteriori estimate can be obtained from the AML algorithm. The presented work shows the broad applicability of the approximate maximum likelihood algorithm and also its robustness in highly stochastic settings. With the implementation in the coalescent simulation program msms (Ewing and Hermisson, 2010) its potential for population genetic applications can easily be explored further. 21 Acknowledgments Johanna Bertl was supported by the Vienna Graduate School of Population Genetics (Austrian Science Fund (FWF): W1225-B20) and worked on this project while employed at the Department of Statistics and Operations Research, University of Vienna, Austria. The computational results presented have partly been achieved using the Vienna Scientific Cluster (VSC). The orang-utan SNP data was kindly provided by X. Ma. Parts of this article have been published in the PhD thesis of Johanna Bertl at the University of Vienna. References M. A. Beaumont. Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution and Systematics, 41:379–406, 2010. M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162:2025–2035, 2002. M. A. Beaumont, J.-M. Cornuet, J.-M. Marin, and C. P. Robert. Adaptive approximate Bayesian computation. Biometrika, 2009. G. Bertorelle, A. Benazzo, and S. Mona. ABC as a flexible framework to estimate demography over space and time: some cons, many pros. Molecular Ecology, 19(13):2609–2625, 2010. M. G. B. Blum and O. François. Non-linear regression models for approximate Bayesian computation. Statistics and Computing, 20:63–73, 2010. M. G. B. Blum, M. A. Nunes, D. Prangle, and S. A. Sisson. A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2):189–208, 2013. M. Creel and D. Kristensen. Indirect Likelihood Inference (revised). UFAE and IAE Working Papers 931.13, Unitat de Fonaments de l’Anàlisi Econòmica (UAB) and Institut d’Anàlisi Econòmica (CSIC), June 2013. URL http: //ideas.repec.org/p/aub/autbar/931.13.html. K. Csilléry, M. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian computation (ABC) in practice. Trends in Ecology and Evolution, 25:410–418, 2010. A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge, 1997. N. De Maio, C. Schlötterer, and C. Kosiol. Linking great apes genome evolution across time scales using polymorphism-aware phylogenetic models. Molecular Biology and Evolution, 30(10):2249–2262, 2013. 22 P. de Valpine. Monte carlo state-space likelihoods by weighted posterior kernel density estimation. Journal of the American Statistical Association, 99(466): 523–536, 2004. P. J. Diggle and R. J. Gratton. Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):193–227, 1984. C. C. Drovandi, A. N. Pettitt, and M. J. Faddy. Approximate Bayesian computation using indirect inference. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):317–337, 2011. E. Ehrlich, A. Jasra, and N. Kantas. Gradient free parameter estimation for hidden markov models with intractable likelihoods. Methodology and Computing in Applied Probability, pages 1–35, 2013. G. Ewing and J. Hermisson. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16):2064–2065, 2010. P. Fearnhead and D. Prangle. Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 74:419–474, 2012. J.-D. Fermanian and B. Salanié. A nonparametric simulated maximum likelihood estimation method. Econometric Theory, 20(4):701–734, 2004. R. C. Griffiths and S. Tavaré. Ancestral inference in population genetics. Statistical Science, 9(3):307–319, 1994. D. Gross, J. J. Shortle, J. M. Thompson, and C. M. Harris. Fundamentals of queueing theory. Wiley, 4th edition, 2008. R. N. Gutenkunst, R. D. Hernandez, S. H. Williamson, and C. D. Bustamante. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics, 5(10), 2009. W. Härdle, M. Müller, S. Sperlich, and A. Werwatz. Nonparametric and semiparametric models. Springer Series in Statistics. Springer, 2004. K. Heggland and A. Frigessi. Estimating functions in indirect inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):447– 462, 2004. J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235–248, 1982a. J. F. C. Kingman. Exchangeability and the evolution of large populations. In G. Koch and F. Spizzichino, editors, Exchangeability in Probability and Statistics, pages 97–112. North-Holland, Amsterdam, 1982b. 23 J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982c. J. F. C. Kingman. The first Erlang century – and the next. Queueing Systems, 63:3–12, 2009. D. P. Locke, L. W. Hillier, W. C. Warren, K. C. Worley, L. V. Nazareth, D. M. Muzny, S.-P. Yang, Z. Wang, A. T. Chinwalla, P. Minx, et al. Comparative and demographic analysis of orang-utan genomes. Nature, 469:529–533, 2011. X. Ma, J. L. Kelly, K. Eilertson, S. Musharoff, J. D. Degenhardt, A. L. Martins, T. Vinar, C. Kosiol, A. Siepel, R. N. Gutenkunst, and C. D. Bustamante. Population genomic analysis reveals a rich speciation and demographic history of orang-utans (Pongo pygmaeus and Pongo abelii ). PLOS ONE, 8, 2013. P. Marjoram and S. Tavaré. Modern computational approaches for analysing molecular genetic variation data. Nature Reviews Genetics, 7:759–770, 2006. T. McKinley, A. R. Cook, and R. Deardon. Inference in epidemic models without likelihoods. The International Journal of Biostatistics, 5, 2009. L. N. Naduvilezhath, L. E. Rose, and D. Metzler. Jaatha: a fast compositelikelihood approach to estimate demographic parameters. Molecular Ecology, 20:2709–2723, 2011. M. Nordborg. Coalescent theory. In D. J. Balding, M. Bishop, and C. Cannings, editors, Handbook of Statistical Genetics, volume 2, pages 843–877. John Wiley & Sons, third edition, 2007. J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. Population growth of human Y chromosomes: A study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798, 1999. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http: //www.R-project.org/. F. J. Rubio and A. M. Johansen. A simple approach to maximum intractable likelihood estimation. Electronic Journal of Statistics, 7:1632–1654, 2013. P. Sadegh. Constrained optimization via stochastic approximation with a simultaneous perturbation gradient approximation. Automatica, 33(5):889–892, 1997. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. S. Soubeyrand, F. Carpentier, N. Desassis, and J. Chadœuf. Inference with a contrast-based posterior distribution and application in spatial statistics. Statistical Methodology, 6(5):466–477, 2009. 24 J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37 (3):352–355, 1992. J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, 2003. M. Stephens. Inference under the coalescent. In D. J. Balding, M. Bishop, and C. Cannings, editors, Handbook of Statistical Genetics, volume 2, pages 878–908. John Wiley & Sons, third edition, 2007. A. Tellier, P. Pfaffelhuber, B. Haubold, L. Naduvilezhath, L. E. Rose, T. Städler, W. Stephan, and D. Metzler. Estimating parameters of speciation models based on refined summaries of the joint site-frequency spectrum. PLOS ONE, 6(5), 05 2011. T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface, 6:187–202, 2009. J. Wakeley. Coalescent Theory. Roberts & Company Publishers, 2008. M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall, 1995. D. Wegmann, C. Leuenberger, and L. Excoffier. Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):1207–1218, 2009. G. Weiss and A. von Haeseler. Inference of population history using a likelihood approach. Genetics, 149(3):1539–1546, 1998. S. Wellek. Testing Statistical Hypotheses of Equivalence and Noninferiority. CRC Press, Taylor & Francis, 2010. Y. Wu. Exact computation of coalescent likelihood for panmictic and subdivided populations under the infinite sites model. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(4):611–618, 2010. 25