Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter June 6, 2018

Bayesian inference of selection in the Wright-Fisher diffusion model

  • Jeffrey J. Gory , Radu Herbei and Laura S. Kubatko EMAIL logo

Abstract

The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.

Award Identifier / Grant number: 1209142

Award Identifier / Grant number: 1407604

Award Identifier / Grant number: 1106706

Funding statement: This work was supported by National Science Foundation grants DMS-1209142 and DMS-1407604 (R.H.) and Funder Id: 10.13039/100000121, DMS-1106706 (L.S.K.).

Acknowledgement

We thank Joshua Schraiber and an anonymous reviewer for helpful comments on an earlier draft.

Appendix

Exact simulation of the neutral Wright-Fisher diffusion

Consider a neutral WF diffusion process Qt

(11){dQt=β0(Qt)dt+Qt(1Qt)dBt,Q0=q0,0tT

where the drift function is β0(x) = (1/2)(θ1(1 − x) − θ2x), for 0 ≤ x ≤ 1 and θ1, θ2 > 0 (note the absence of the selection parameter). The necessary details for simulating a variate QT from the model (11) for a fixed T > 0, given that Q0 = q0 ∈ (0, 1) are given in Jenkins and Spanó (2015) and we briefly review the main ideas. It is well-known (Griffiths, 1979; Tavaré, 1984; Ethier & Griffiths, 1993; Griffiths & Spanò, 2013) that, conditional on Q0 = q0, the variate QT has a probability density function given by

(12)f(x|q0)=m=0qmθ(T)l=0mBin(l;m,q0)Beta(x;θ1+l,θ2+ml)

where θ = θ1 + θ2, and in(⋅) and eta(⋅) are the Binomial and Beta density functions, given by

Bin(l;m,q0)=(ml)q0l(1q0)mlBeta(x;θ1,θ2)xθ11(1x)θ21.

The mixture weights qmθ(T) in (12) have an alternating series expansion

(13)qmθ(T)=k=m(1)kmakmθek(k+θ1)T/2

where

akmθ=(θ+2k1)(θ+m)(k1)m!(km)!

with the convention that a(x) ≡ Γ(a + x)/Γ(a) for a > 0 and x ≥ −1. Sampling from the density (12) is achieved via Algorithm 2 described below (see Jenkins & Spanó, 2015).

Algorithm 2.

Algorithm 2. Exact Sampling for the neutral WF diffusion:

1: Draw 𝖬 from the discrete distribution {qmθ(T),m=0,1,};

2: Draw LBin(M,q0);

3: Draw QBeta(θ1+L,θ2+ML);

4: Return 𝖰.

The computational burden in Algorithm 2 is in Step 1, since the weights qmθ(T) do not have a closed form expression and (13) suggests an infinite amount of computation. However, given the alternating series representation in (13), one can use ideas described in Chapter 5 of Devroye (1986) to devise an exact procedure for simulating the variate 𝖬, which only requires finite computing time. Details of such a procedure are provided in Section 3.2 of Jenkins and Spanó (2015).

Exact simulation of general WF-diffusion models

As we mention above, our approach relies on being able to obtain exact draws from a general Wright-Fisher diffusion (1). We briefly review the general setup for sampling an SDE. Assume that {Xt, 0 ≤ tT} is a stochastic process described via

(14)dXt=β(Xt)dt+σ(Xt)dBt,X(0)=x0,0tT

where the functions β(⋅) and σ(⋅) are assumed to be smooth enough such that (14) has a unique weak solution. Such conditions can be found in Karatzas and Shreve (1998), for example. Our goal is to simulate exact draws from the distribution of Xt, for some fixed t > 0, conditional on X0 = x0. In general, the transition density for the process Xt is not available in closed form, not even in an infinite series representation as is the case with the neutral Wright-Fisher diffusion. Recently, in a series of papers (Beskos & Roberts, 2005; Beskos, Papaspiliopoulos & Roberts 2006b; 2008), the authors have developed a rejection sampling approach, which yields an exact (distribution-wise) skeleton of the full path (Xt, 0 ≤ tT) for a very large class of diffusions. Briefly, let Ω = 𝒞([0, T]) and let ω be a typical element of Ω. Let ℚ be the probability measure induced by {Xt, 0 ≤ tT} onto Ω and ℤ be another probability measure on Ω which is user-selected in an appropriate way. Under regularity conditions (see Beskos & Roberts, 2005) one can use the Girsanov formula to derive and expression for the Radon-Nikodym derivative

dQdZ(ω)G(ω)

where, in principle, one can arrange that 𝖦(⋅) ∈ (0, 1). Thus, a rejection sampling strategy will be appropriate:

Algorithm 3.

Algorithm 3. Exact sampling for general diffusions

1: Sample ω ∼ ℤ;

2: Accept ω as a draw from ℚ with probability 𝖦(ω) ∈ (0, 1).

The proposal distribution ℤ has to be selected in a way that Step 1 of Algorithm 3 can be done efficiently, i.e. a biased Brownian Bridge measure. We refer the reader to Beskos and Roberts (2005) for a detailed description of all of the conditions and the general setup.

General Wright-Fisher diffusions. If the diffusion coefficient in (14) takes the form

σ(x)=x(1x),

then the SDE (14) is a general WF diffusion. As Jenkins and Spanó (2015) suggest, in this case, it is efficient to select the proposal measure ℤ to be the law induced by the neutral WF diffusion (11).

References

Aït-Sahalia, Y. (2002): “Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach,” Econometrica, 70, 223–262.10.1111/1468-0262.00274Search in Google Scholar

Aït-Sahalia, Y. (2008): “Closed-form likelihood expansions for multivariate diffusions,” Ann. Stat., 36, 906–937.10.3386/w8956Search in Google Scholar

Alachiotis, N., A. Stamatakis and P. Pavlidis (2012): “OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets,” Bioinformatics, 28, 2274–2275.10.1093/bioinformatics/bts419Search in Google Scholar PubMed

Beskos, A. and G. O. Roberts (2005): “Exact simulation of diffusions,” Ann. Appl. Probab., 15, 2422–2444.10.1214/105051605000000485Search in Google Scholar

Beskos, A., O. Papaspiliopoulos, G. O. Roberts and P. Fearnhead (2006a): “Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes,” J. R. Stat. Soc. Series B Stat. Methodol., 68, 333–382.10.1111/j.1467-9868.2006.00552.xSearch in Google Scholar

Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2006b): “Retrospective exact simulation of diffusion sample paths with applications,” Bernoulli, 12, 1077–1098.10.3150/bj/1165269151Search in Google Scholar

Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2008): “A factorisation of diffusion measure and finite sample path constructions,” Methodol. Comput. Appl. Probab., 10, 85–104.10.1007/s11009-007-9060-4Search in Google Scholar

Brandt, M. W. and P. Santa-Clara (2002): “Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets,” J. Financ. Econ., 63, 161–210.10.3386/t0274Search in Google Scholar

Burbrink, F. T. and T. J. Guiher (2015): “Considering gene flow when using coalescent methods to delimit lineages of North American pitvipers of the genus Agkistrodon,” Zool. J. Linn. Soc., 173, 505–526.10.1111/zoj.12211Search in Google Scholar

Bustamante, C. D., J. Wakeley, S. Sawyer and D. L. Hartl (2001): “Directional selection and the site-frequency spectrum,” Genetics, 159, 1779–1788.10.1093/genetics/159.4.1779Search in Google Scholar PubMed PubMed Central

Coffman, A. C., P. H. Hsieh, S. Gravel and R. N. Gutenkunst (2015): “Computationally efficient composite likelihood statistics for demographic inference,” Mol. Biol. Evol., 33, 591–593.10.1093/molbev/msv255Search in Google Scholar PubMed PubMed Central

Devroye, L. (1986): Non-uniform random variate generation, New York: Springer-Verlag.10.1007/978-1-4613-8643-8Search in Google Scholar

Durham, G. B. and A. R. Gallant (2002): “Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes,” J. Bus. Econ. Stat., 20, 297–316.10.1198/073500102288618397Search in Google Scholar

Eckert, A. J. and B. C. Carstens (2008): “Does gene flow destroy phylogenetic signal? The performance of three methods for estimating species phylogenies in the presence of gene flow,” Mol. Phylogenetics Evol., 49, 832–842.10.1016/j.ympev.2008.09.008Search in Google Scholar PubMed

Edwards, S. V. (2009): “Is a new and general theory of molecular systematics emerging?” Evolution, 63, 1–19.10.1111/j.1558-5646.2008.00549.xSearch in Google Scholar PubMed

Elerian, O., S. Chib, and N. Shephard (2001): “Likelihood inference for discretely observed nonlinear diffusions,” Econometrica, 69, 959–993.10.1111/1468-0262.00226Search in Google Scholar

Etheridge, A. (2011): Some mathematical models from population genetics: École D’Été de Probabilités de Saint-Flour XXXIX-2009, Vol. 39, Heidelberg: Springer Science & Business Media.10.1007/978-3-642-16632-7Search in Google Scholar

Ethier, S. N. and R. C. Griffiths (1993): “The transition function of a Fleming-Viot process,” Ann. Probab., 21, 1571–1590.10.1214/aop/1176989131Search in Google Scholar

Ferrer-Admetlla, A., C. Leuenberger, J. D. Jensen and D. Wegmann (2016): “An approximate Markov model for the Wright-Fisher diffusion and its application to time series data,” Genetics, 203, 831–846.10.1534/genetics.115.184598Search in Google Scholar PubMed PubMed Central

Fisher, R. A. (1930): The genetical theory of natural selection, Oxford: Clarendon Press.10.5962/bhl.title.27468Search in Google Scholar

Foll, M., H. Shim and J. D. Jensen (2015): “WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data,” Mol. Ecol. Resour., 15, 87–98.10.1111/1755-0998.12280Search in Google Scholar PubMed

Golightly, A. and D. J. Wilkinson (2008): “Bayesian inference for nonlinear multivariate diffusion models observed with error,” Comput. Stat. Data Anal., 52, 1674–1693.10.1016/j.csda.2007.05.019Search in Google Scholar

Griffiths, R. C. (1979): “A transition density expansion for a multi-allele diffusion model,” Adv. Appl. Probab., 11, 310–325.10.2307/1426842Search in Google Scholar

Griffiths, R. C. and D. Spanò (2013): “Orthogonal polynomial kernels and canonical correlations for Dirichlet measures,” Bernoulli, 19, 548–598.10.3150/11-BEJ403Search in Google Scholar

Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson and C. D. Bustamante (2009): “Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data,” PLoS Genet., 5, e10000695.10.1371/journal.pgen.1000695Search in Google Scholar PubMed PubMed Central

Gutenkunst, R., R. Hernandez, S. Williamson and C. Bustamante (2010): “Diffusion approximations for demographic inference: DaDi,” Nat. Prec., <http://hdl.handle.net/10101/npre.2010.4594.1>.10.1038/npre.2010.4594.1Search in Google Scholar

Hey, J. and R. Nielsen (2007): “Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics,” Proc. Natl. Acad. Sci. USA, 104, 2785–2790.10.1073/pnas.0611164104Search in Google Scholar PubMed PubMed Central

Jenkins, P. A. and D. Spanó (2015): “Exact simulation of the Wright-Fisher diffusion,” arXiv:1506.06998v1.10.1214/16-AAP1236Search in Google Scholar

Jewett, E. M., M. Steinrücken and Y. S. Song (2016): “The effects of population size histories on estimates of selection coefficients from time-series genetic data,” Mol. Biol. Evol., 33, 3002–3027.10.1093/molbev/msw173Search in Google Scholar PubMed PubMed Central

Karatzas, I. and S. Shreve (1998): Brownian motion and stochastic calculus, vol. 113, New York: Springer Science & Business Media.10.1007/978-1-4612-0949-2Search in Google Scholar

Kim, Y. and R. Nielsen (2004): “Linkage disequilibrium as a signature of selective sweeps,” Genetics, 167, 1513–1524.10.1534/genetics.103.025387Search in Google Scholar

Kingman, J. F. C. (1982a): “The coalescent,” Stoch. Process. Appl., 13, 235–248.10.1016/0304-4149(82)90011-4Search in Google Scholar

Kingman, J. F. C. (1982b): “Exchangeability and the evolution of large populations.” In: Koch, G. and F. Spizzichino, editors, Exchangeability in Probability and Statistics, North-Holland, Amsterdam, pp. 97–112.Search in Google Scholar

Kingman, J. F. C. (1982c): “On the genealogy of large populations,” J. Appl. Probab., 19A, 27–43.10.1017/S0021900200034446Search in Google Scholar

Kloeden, P. E. and E. Platen (1992): Numerical solution of stochastic differential equations (stochastic modelling and applied probability), Springer-Verlag.10.1007/978-3-662-12616-5Search in Google Scholar

Knowles, L. L. and L. S. Kubatko (2010): Estimating species trees: practical and theoretical aspects, Wiley-Blackwell.Search in Google Scholar

Leache, A. D., R. B. Harris, B. Rannala and Z. Yang (2014): “The influence of gene flow on species tree estimation: a simulation study,” Syst. Biol., 63, 17–30.10.1093/sysbio/syt049Search in Google Scholar PubMed

Li, H. and W. Stephan (2005): “Maximum-likelihood methods for detecting recent positive selection and localizing the selected site in the genome,” Genetics, 171, 377–384.10.1534/genetics.105.041368Search in Google Scholar PubMed PubMed Central

Lin, M., R. Chen and P. Mykland (2010): “On generating Monte Carlo samples of continuous diffusion bridges,” J. Am. Stat. Assoc., 105, 820–838.10.1198/jasa.2010.tm09057Search in Google Scholar

Lo, A. W. (1988): “Maximum likelihood estimation of generalized Itô processes with discretely sampled data,” Econ. Theor., 4, 231–247.10.3386/t0059Search in Google Scholar

Malaspinas, A.-S., O. Malaspinas, S. N. Evans and M. Slatkin (2012): “Estimating allele age and selection coefficient from time-serial data,” Genetics, 192, 599–607.10.1534/genetics.112.140939Search in Google Scholar PubMed PubMed Central

Nielsen, R. and J. Wakeley (2001): “Distinguishing migration from isolation: a Markov chain Monte Carlo approach,” Genetics, 158, 885–896.10.1093/genetics/158.2.885Search in Google Scholar PubMed PubMed Central

Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark and C. Bustamante (2005): “Genomic scans for selective sweeps using SNP data,” Genome Res., 15, 1566–1575.10.1101/gr.4252305Search in Google Scholar

Pavlidis, P., D. Živković, A. Stamatakis and N. Alachiotis (2013): “SweeD: likelihood-based detection of selective sweeps in thousands of genomes,” Mol. Biol. Evol., 30, 2224–2234.10.1093/molbev/mst112Search in Google Scholar

Pedersen, A. R. (1995): “A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations,” Scand. J. Stat., 22, 55–71.Search in Google Scholar

Robinson, J. D., A. J. Coffman, M. J. Hickerson and R. N. Gutenkunst (2014): “Sampling strategies for frequency spectrum-based population genomic inference,” BMC Evol. Biol., 14, 254.10.1186/s12862-014-0254-4Search in Google Scholar

Ronen, R., N. Upda, E. Halperin and V. Bafna (2013): “Learning natural selection from the site frequency spectrum,” Genetics, 195, 181–193.10.1007/978-3-642-37195-0_19Search in Google Scholar

Schraiber, J. G., S. N. Evans and M. Slatkin (2016): “Bayesian inference of natural selection from allele frequency time series,” Genetics, 203, 493–511.10.1534/genetics.116.187278Search in Google Scholar

Scott, D. W. (1992): Multivariate density estimation: theory, practice, and visualization, John Wiley.10.1002/9780470316849Search in Google Scholar

Song, Y. S. and M. Steinrücken (2012): “A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection,” Genetics, 190, 1117–1129.10.1534/genetics.111.136929Search in Google Scholar

Steinrücken, M., Y. X. R. Wang and Y. S. Song (2013): “An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection,” Theor. Popul. Biol., 83, 1–14.10.1016/j.tpb.2012.10.006Search in Google Scholar

Sun, L., C. Lee and J. A. Hoeting (2015): “Parameter inference and model selection in deterministic and stochastic dynamical models via approximate bayesian computation: modeling a wildlife epidemic,” Environmetrics, 26, 451–462.10.1002/env.2353Search in Google Scholar

Tavaré, S. (1984): “Line-of-descent and genealogical processes, and their applications in population genetics models,” Theor. Popul. Biol., 26, 119–164.10.1016/0040-5809(84)90027-3Search in Google Scholar

Wakeley, J. (2009): Coalescent theory: an introduction, Roberts and Company.Search in Google Scholar

Wakeley, J. and J. Hey (1997): “Estimating ancestral population parameters,” Genetics, 145, 847–855.10.1093/genetics/145.3.847Search in Google Scholar PubMed PubMed Central

Williamson, S. H., R. Hernandez, A. Fledel-Alon, L. Zhu, R. Nielsen and C. D. Bustamante (2005): Simultaneous inference of selection and population growth from patterns of variation in the human genome,” Proc. Natl. Acad. Sci., 102, 7882–7887.10.1073/pnas.0502300102Search in Google Scholar PubMed PubMed Central

Wright, S. (1931): “Evolution in Mendelian populations,” Genetics, 16, 97–159.10.1093/genetics/16.2.97Search in Google Scholar PubMed PubMed Central

Wutke, S., N. Benecke, E. Sandoval-Castellanos, H.-J. Dohle, S. Friederich, J. Gonzalez, J. H. Hallsson, M. Hofreiter, L. Lougas, O. Magnell, A. Morales-Muniz, L. Orlando, A. H. Palsdottir, M. Reissmann, M. Ruttkay, A. Trinks and A. Ludwig (2016): Spotted phenotypes in horses lost attractiveness in the Middle Ages,” Sci. Rep., 6, 38548.10.1038/srep38548Search in Google Scholar PubMed PubMed Central

Zhou, D., N. Upda, M. Gersten, D. W. Visk, A. Bashir, J. Xue, K. A. Frazer, J. W. Posakony, S. Subramaniam, V. Bafna and G. G. Haddad (2011): Experimental selection of hypoxia-tolerant Drosophila melanogaster,” Proc. Natl. Acad. Sci., 108, 2349–2354.10.1073/pnas.1010643108Search in Google Scholar PubMed PubMed Central

Živković, D., M. Steinrücken, Y. S. Song and W. Stephan (2015): “Transition densities and sample frequency spectra of diffusion processes with selection and variable population size,” Genetics, 200, 601–617.10.1534/genetics.115.175265Search in Google Scholar PubMed PubMed Central

Published Online: 2018-6-6

©2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 29.9.2023 from https://www.degruyter.com/document/doi/10.1515/sagmb-2017-0046/html
Scroll to top button