Abstract
The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.
Funding source: Division of Mathematical Sciences
Award Identifier / Grant number: 1209142
Award Identifier / Grant number: 1407604
Award Identifier / Grant number: 1106706
Funding statement: This work was supported by National Science Foundation grants DMS-1209142 and DMS-1407604 (R.H.) and Funder Id: 10.13039/100000121, DMS-1106706 (L.S.K.).
Acknowledgement
We thank Joshua Schraiber and an anonymous reviewer for helpful comments on an earlier draft.
Appendix
Exact simulation of the neutral Wright-Fisher diffusion
Consider a neutral WF diffusion process Qt
where the drift function is β0(x) = (1/2)(θ1(1 − x) − θ2x), for 0 ≤ x ≤ 1 and θ1, θ2 > 0 (note the absence of the selection parameter). The necessary details for simulating a variate QT from the model (11) for a fixed T > 0, given that Q0 = q0 ∈ (0, 1) are given in Jenkins and Spanó (2015) and we briefly review the main ideas. It is well-known (Griffiths, 1979; Tavaré, 1984; Ethier & Griffiths, 1993; Griffiths & Spanò, 2013) that, conditional on Q0 = q0, the variate QT has a probability density function given by
where θ = θ1 + θ2, and ℬin(⋅) and ℬeta(⋅) are the Binomial and Beta density functions, given by
The mixture weights
where
with the convention that a(x) ≡ Γ(a + x)/Γ(a) for a > 0 and x ≥ −1. Sampling from the density (12) is achieved via Algorithm 2 described below (see Jenkins & Spanó, 2015).
Algorithm 2. Exact Sampling for the neutral WF diffusion:
1: Draw 𝖬 from the discrete distribution
2: Draw
3: Draw
4: Return 𝖰.
The computational burden in Algorithm 2 is in Step 1, since the weights
Exact simulation of general WF-diffusion models
As we mention above, our approach relies on being able to obtain exact draws from a general Wright-Fisher diffusion (1). We briefly review the general setup for sampling an SDE. Assume that {Xt, 0 ≤ t ≤ T} is a stochastic process described via
where the functions β(⋅) and σ(⋅) are assumed to be smooth enough such that (14) has a unique weak solution. Such conditions can be found in Karatzas and Shreve (1998), for example. Our goal is to simulate exact draws from the distribution of Xt, for some fixed t > 0, conditional on X0 = x0. In general, the transition density for the process Xt is not available in closed form, not even in an infinite series representation as is the case with the neutral Wright-Fisher diffusion. Recently, in a series of papers (Beskos & Roberts, 2005; Beskos, Papaspiliopoulos & Roberts 2006b; 2008), the authors have developed a rejection sampling approach, which yields an exact (distribution-wise) skeleton of the full path (Xt, 0 ≤ t ≤ T) for a very large class of diffusions. Briefly, let Ω = 𝒞([0, T]) and let ω be a typical element of Ω. Let ℚ be the probability measure induced by {Xt, 0 ≤ t ≤ T} onto Ω and ℤ be another probability measure on Ω which is user-selected in an appropriate way. Under regularity conditions (see Beskos & Roberts, 2005) one can use the Girsanov formula to derive and expression for the Radon-Nikodym derivative
where, in principle, one can arrange that 𝖦(⋅) ∈ (0, 1). Thus, a rejection sampling strategy will be appropriate:
Algorithm 3. Exact sampling for general diffusions
1: Sample ω ∼ ℤ;
2: Accept ω as a draw from ℚ with probability 𝖦(ω) ∈ (0, 1).
The proposal distribution ℤ has to be selected in a way that Step 1 of Algorithm 3 can be done efficiently, i.e. a biased Brownian Bridge measure. We refer the reader to Beskos and Roberts (2005) for a detailed description of all of the conditions and the general setup.
General Wright-Fisher diffusions. If the diffusion coefficient in (14) takes the form
then the SDE (14) is a general WF diffusion. As Jenkins and Spanó (2015) suggest, in this case, it is efficient to select the proposal measure ℤ to be the law induced by the neutral WF diffusion (11).
References
Aït-Sahalia, Y. (2002): “Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach,” Econometrica, 70, 223–262.10.1111/1468-0262.00274Search in Google Scholar
Aït-Sahalia, Y. (2008): “Closed-form likelihood expansions for multivariate diffusions,” Ann. Stat., 36, 906–937.10.3386/w8956Search in Google Scholar
Alachiotis, N., A. Stamatakis and P. Pavlidis (2012): “OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets,” Bioinformatics, 28, 2274–2275.10.1093/bioinformatics/bts419Search in Google Scholar PubMed
Beskos, A. and G. O. Roberts (2005): “Exact simulation of diffusions,” Ann. Appl. Probab., 15, 2422–2444.10.1214/105051605000000485Search in Google Scholar
Beskos, A., O. Papaspiliopoulos, G. O. Roberts and P. Fearnhead (2006a): “Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes,” J. R. Stat. Soc. Series B Stat. Methodol., 68, 333–382.10.1111/j.1467-9868.2006.00552.xSearch in Google Scholar
Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2006b): “Retrospective exact simulation of diffusion sample paths with applications,” Bernoulli, 12, 1077–1098.10.3150/bj/1165269151Search in Google Scholar
Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2008): “A factorisation of diffusion measure and finite sample path constructions,” Methodol. Comput. Appl. Probab., 10, 85–104.10.1007/s11009-007-9060-4Search in Google Scholar
Brandt, M. W. and P. Santa-Clara (2002): “Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets,” J. Financ. Econ., 63, 161–210.10.3386/t0274Search in Google Scholar
Burbrink, F. T. and T. J. Guiher (2015): “Considering gene flow when using coalescent methods to delimit lineages of North American pitvipers of the genus Agkistrodon,” Zool. J. Linn. Soc., 173, 505–526.10.1111/zoj.12211Search in Google Scholar
Bustamante, C. D., J. Wakeley, S. Sawyer and D. L. Hartl (2001): “Directional selection and the site-frequency spectrum,” Genetics, 159, 1779–1788.10.1093/genetics/159.4.1779Search in Google Scholar PubMed PubMed Central
Coffman, A. C., P. H. Hsieh, S. Gravel and R. N. Gutenkunst (2015): “Computationally efficient composite likelihood statistics for demographic inference,” Mol. Biol. Evol., 33, 591–593.10.1093/molbev/msv255Search in Google Scholar PubMed PubMed Central
Devroye, L. (1986): Non-uniform random variate generation, New York: Springer-Verlag.10.1007/978-1-4613-8643-8Search in Google Scholar
Durham, G. B. and A. R. Gallant (2002): “Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes,” J. Bus. Econ. Stat., 20, 297–316.10.1198/073500102288618397Search in Google Scholar
Eckert, A. J. and B. C. Carstens (2008): “Does gene flow destroy phylogenetic signal? The performance of three methods for estimating species phylogenies in the presence of gene flow,” Mol. Phylogenetics Evol., 49, 832–842.10.1016/j.ympev.2008.09.008Search in Google Scholar PubMed
Edwards, S. V. (2009): “Is a new and general theory of molecular systematics emerging?” Evolution, 63, 1–19.10.1111/j.1558-5646.2008.00549.xSearch in Google Scholar PubMed
Elerian, O., S. Chib, and N. Shephard (2001): “Likelihood inference for discretely observed nonlinear diffusions,” Econometrica, 69, 959–993.10.1111/1468-0262.00226Search in Google Scholar
Etheridge, A. (2011): Some mathematical models from population genetics: École D’Été de Probabilités de Saint-Flour XXXIX-2009, Vol. 39, Heidelberg: Springer Science & Business Media.10.1007/978-3-642-16632-7Search in Google Scholar
Ethier, S. N. and R. C. Griffiths (1993): “The transition function of a Fleming-Viot process,” Ann. Probab., 21, 1571–1590.10.1214/aop/1176989131Search in Google Scholar
Ferrer-Admetlla, A., C. Leuenberger, J. D. Jensen and D. Wegmann (2016): “An approximate Markov model for the Wright-Fisher diffusion and its application to time series data,” Genetics, 203, 831–846.10.1534/genetics.115.184598Search in Google Scholar PubMed PubMed Central
Fisher, R. A. (1930): The genetical theory of natural selection, Oxford: Clarendon Press.10.5962/bhl.title.27468Search in Google Scholar
Foll, M., H. Shim and J. D. Jensen (2015): “WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data,” Mol. Ecol. Resour., 15, 87–98.10.1111/1755-0998.12280Search in Google Scholar PubMed
Golightly, A. and D. J. Wilkinson (2008): “Bayesian inference for nonlinear multivariate diffusion models observed with error,” Comput. Stat. Data Anal., 52, 1674–1693.10.1016/j.csda.2007.05.019Search in Google Scholar
Griffiths, R. C. (1979): “A transition density expansion for a multi-allele diffusion model,” Adv. Appl. Probab., 11, 310–325.10.2307/1426842Search in Google Scholar
Griffiths, R. C. and D. Spanò (2013): “Orthogonal polynomial kernels and canonical correlations for Dirichlet measures,” Bernoulli, 19, 548–598.10.3150/11-BEJ403Search in Google Scholar
Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson and C. D. Bustamante (2009): “Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data,” PLoS Genet., 5, e10000695.10.1371/journal.pgen.1000695Search in Google Scholar PubMed PubMed Central
Gutenkunst, R., R. Hernandez, S. Williamson and C. Bustamante (2010): “Diffusion approximations for demographic inference: DaDi,” Nat. Prec., <http://hdl.handle.net/10101/npre.2010.4594.1>.10.1038/npre.2010.4594.1Search in Google Scholar
Hey, J. and R. Nielsen (2007): “Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics,” Proc. Natl. Acad. Sci. USA, 104, 2785–2790.10.1073/pnas.0611164104Search in Google Scholar PubMed PubMed Central
Jenkins, P. A. and D. Spanó (2015): “Exact simulation of the Wright-Fisher diffusion,” arXiv:1506.06998v1.10.1214/16-AAP1236Search in Google Scholar
Jewett, E. M., M. Steinrücken and Y. S. Song (2016): “The effects of population size histories on estimates of selection coefficients from time-series genetic data,” Mol. Biol. Evol., 33, 3002–3027.10.1093/molbev/msw173Search in Google Scholar PubMed PubMed Central
Karatzas, I. and S. Shreve (1998): Brownian motion and stochastic calculus, vol. 113, New York: Springer Science & Business Media.10.1007/978-1-4612-0949-2Search in Google Scholar
Kim, Y. and R. Nielsen (2004): “Linkage disequilibrium as a signature of selective sweeps,” Genetics, 167, 1513–1524.10.1534/genetics.103.025387Search in Google Scholar
Kingman, J. F. C. (1982a): “The coalescent,” Stoch. Process. Appl., 13, 235–248.10.1016/0304-4149(82)90011-4Search in Google Scholar
Kingman, J. F. C. (1982b): “Exchangeability and the evolution of large populations.” In: Koch, G. and F. Spizzichino, editors, Exchangeability in Probability and Statistics, North-Holland, Amsterdam, pp. 97–112.Search in Google Scholar
Kingman, J. F. C. (1982c): “On the genealogy of large populations,” J. Appl. Probab., 19A, 27–43.10.1017/S0021900200034446Search in Google Scholar
Kloeden, P. E. and E. Platen (1992): Numerical solution of stochastic differential equations (stochastic modelling and applied probability), Springer-Verlag.10.1007/978-3-662-12616-5Search in Google Scholar
Knowles, L. L. and L. S. Kubatko (2010): Estimating species trees: practical and theoretical aspects, Wiley-Blackwell.Search in Google Scholar
Leache, A. D., R. B. Harris, B. Rannala and Z. Yang (2014): “The influence of gene flow on species tree estimation: a simulation study,” Syst. Biol., 63, 17–30.10.1093/sysbio/syt049Search in Google Scholar PubMed
Li, H. and W. Stephan (2005): “Maximum-likelihood methods for detecting recent positive selection and localizing the selected site in the genome,” Genetics, 171, 377–384.10.1534/genetics.105.041368Search in Google Scholar PubMed PubMed Central
Lin, M., R. Chen and P. Mykland (2010): “On generating Monte Carlo samples of continuous diffusion bridges,” J. Am. Stat. Assoc., 105, 820–838.10.1198/jasa.2010.tm09057Search in Google Scholar
Lo, A. W. (1988): “Maximum likelihood estimation of generalized Itô processes with discretely sampled data,” Econ. Theor., 4, 231–247.10.3386/t0059Search in Google Scholar
Malaspinas, A.-S., O. Malaspinas, S. N. Evans and M. Slatkin (2012): “Estimating allele age and selection coefficient from time-serial data,” Genetics, 192, 599–607.10.1534/genetics.112.140939Search in Google Scholar PubMed PubMed Central
Nielsen, R. and J. Wakeley (2001): “Distinguishing migration from isolation: a Markov chain Monte Carlo approach,” Genetics, 158, 885–896.10.1093/genetics/158.2.885Search in Google Scholar PubMed PubMed Central
Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark and C. Bustamante (2005): “Genomic scans for selective sweeps using SNP data,” Genome Res., 15, 1566–1575.10.1101/gr.4252305Search in Google Scholar
Pavlidis, P., D. Živković, A. Stamatakis and N. Alachiotis (2013): “SweeD: likelihood-based detection of selective sweeps in thousands of genomes,” Mol. Biol. Evol., 30, 2224–2234.10.1093/molbev/mst112Search in Google Scholar
Pedersen, A. R. (1995): “A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations,” Scand. J. Stat., 22, 55–71.Search in Google Scholar
Robinson, J. D., A. J. Coffman, M. J. Hickerson and R. N. Gutenkunst (2014): “Sampling strategies for frequency spectrum-based population genomic inference,” BMC Evol. Biol., 14, 254.10.1186/s12862-014-0254-4Search in Google Scholar
Ronen, R., N. Upda, E. Halperin and V. Bafna (2013): “Learning natural selection from the site frequency spectrum,” Genetics, 195, 181–193.10.1007/978-3-642-37195-0_19Search in Google Scholar
Schraiber, J. G., S. N. Evans and M. Slatkin (2016): “Bayesian inference of natural selection from allele frequency time series,” Genetics, 203, 493–511.10.1534/genetics.116.187278Search in Google Scholar
Scott, D. W. (1992): Multivariate density estimation: theory, practice, and visualization, John Wiley.10.1002/9780470316849Search in Google Scholar
Song, Y. S. and M. Steinrücken (2012): “A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection,” Genetics, 190, 1117–1129.10.1534/genetics.111.136929Search in Google Scholar
Steinrücken, M., Y. X. R. Wang and Y. S. Song (2013): “An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection,” Theor. Popul. Biol., 83, 1–14.10.1016/j.tpb.2012.10.006Search in Google Scholar
Sun, L., C. Lee and J. A. Hoeting (2015): “Parameter inference and model selection in deterministic and stochastic dynamical models via approximate bayesian computation: modeling a wildlife epidemic,” Environmetrics, 26, 451–462.10.1002/env.2353Search in Google Scholar
Tavaré, S. (1984): “Line-of-descent and genealogical processes, and their applications in population genetics models,” Theor. Popul. Biol., 26, 119–164.10.1016/0040-5809(84)90027-3Search in Google Scholar
Wakeley, J. (2009): Coalescent theory: an introduction, Roberts and Company.Search in Google Scholar
Wakeley, J. and J. Hey (1997): “Estimating ancestral population parameters,” Genetics, 145, 847–855.10.1093/genetics/145.3.847Search in Google Scholar PubMed PubMed Central
Williamson, S. H., R. Hernandez, A. Fledel-Alon, L. Zhu, R. Nielsen and C. D. Bustamante (2005): Simultaneous inference of selection and population growth from patterns of variation in the human genome,” Proc. Natl. Acad. Sci., 102, 7882–7887.10.1073/pnas.0502300102Search in Google Scholar PubMed PubMed Central
Wright, S. (1931): “Evolution in Mendelian populations,” Genetics, 16, 97–159.10.1093/genetics/16.2.97Search in Google Scholar PubMed PubMed Central
Wutke, S., N. Benecke, E. Sandoval-Castellanos, H.-J. Dohle, S. Friederich, J. Gonzalez, J. H. Hallsson, M. Hofreiter, L. Lougas, O. Magnell, A. Morales-Muniz, L. Orlando, A. H. Palsdottir, M. Reissmann, M. Ruttkay, A. Trinks and A. Ludwig (2016): Spotted phenotypes in horses lost attractiveness in the Middle Ages,” Sci. Rep., 6, 38548.10.1038/srep38548Search in Google Scholar PubMed PubMed Central
Zhou, D., N. Upda, M. Gersten, D. W. Visk, A. Bashir, J. Xue, K. A. Frazer, J. W. Posakony, S. Subramaniam, V. Bafna and G. G. Haddad (2011): Experimental selection of hypoxia-tolerant Drosophila melanogaster,” Proc. Natl. Acad. Sci., 108, 2349–2354.10.1073/pnas.1010643108Search in Google Scholar PubMed PubMed Central
Živković, D., M. Steinrücken, Y. S. Song and W. Stephan (2015): “Transition densities and sample frequency spectra of diffusion processes with selection and variable population size,” Genetics, 200, 601–617.10.1534/genetics.115.175265Search in Google Scholar PubMed PubMed Central
©2018 Walter de Gruyter GmbH, Berlin/Boston