Bayesian inference of selection in the Wright-Fisher diffusion model

Jeffrey J. Gory 1 , Radu Herbei 1  and Laura S. Kubatko 2
  • 1 Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
  • 2 Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA
Jeffrey J. Gory, Radu Herbei and Laura S. Kubatko
  • Corresponding author
  • Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA
  • Email
  • Search for other articles:
  • degruyter.comGoogle Scholar

Abstract

The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.

  • Aït-Sahalia, Y. (2002): “Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach,” Econometrica, 70, 223–262.

    • Crossref
    • Export Citation
  • Aït-Sahalia, Y. (2008): “Closed-form likelihood expansions for multivariate diffusions,” Ann. Stat., 36, 906–937.

    • Crossref
    • Export Citation
  • Alachiotis, N., A. Stamatakis and P. Pavlidis (2012): “OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets,” Bioinformatics, 28, 2274–2275.

    • Crossref
    • PubMed
    • Export Citation
  • Beskos, A. and G. O. Roberts (2005): “Exact simulation of diffusions,” Ann. Appl. Probab., 15, 2422–2444.

    • Crossref
    • Export Citation
  • Beskos, A., O. Papaspiliopoulos, G. O. Roberts and P. Fearnhead (2006a): “Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes,” J. R. Stat. Soc. Series B Stat. Methodol., 68, 333–382.

    • Crossref
    • Export Citation
  • Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2006b): “Retrospective exact simulation of diffusion sample paths with applications,” Bernoulli, 12, 1077–1098.

    • Crossref
    • Export Citation
  • Beskos, A., O. Papaspiliopoulos and G. O. Roberts (2008): “A factorisation of diffusion measure and finite sample path constructions,” Methodol. Comput. Appl. Probab., 10, 85–104.

    • Crossref
    • Export Citation
  • Brandt, M. W. and P. Santa-Clara (2002): “Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets,” J. Financ. Econ., 63, 161–210.

    • Crossref
    • Export Citation
  • Burbrink, F. T. and T. J. Guiher (2015): “Considering gene flow when using coalescent methods to delimit lineages of North American pitvipers of the genus Agkistrodon,” Zool. J. Linn. Soc., 173, 505–526.

    • Crossref
    • Export Citation
  • Bustamante, C. D., J. Wakeley, S. Sawyer and D. L. Hartl (2001): “Directional selection and the site-frequency spectrum,” Genetics, 159, 1779–1788.

    • PubMed
    • Export Citation
  • Coffman, A. C., P. H. Hsieh, S. Gravel and R. N. Gutenkunst (2015): “Computationally efficient composite likelihood statistics for demographic inference,” Mol. Biol. Evol., 33, 591–593.

    • PubMed
    • Export Citation
  • Devroye, L. (1986): Non-uniform random variate generation, New York: Springer-Verlag.

  • Durham, G. B. and A. R. Gallant (2002): “Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes,” J. Bus. Econ. Stat., 20, 297–316.

    • Crossref
    • Export Citation
  • Eckert, A. J. and B. C. Carstens (2008): “Does gene flow destroy phylogenetic signal? The performance of three methods for estimating species phylogenies in the presence of gene flow,” Mol. Phylogenetics Evol., 49, 832–842.

    • Crossref
    • Export Citation
  • Edwards, S. V. (2009): “Is a new and general theory of molecular systematics emerging?” Evolution, 63, 1–19.

    • Crossref
    • PubMed
    • Export Citation
  • Elerian, O., S. Chib, and N. Shephard (2001): “Likelihood inference for discretely observed nonlinear diffusions,” Econometrica, 69, 959–993.

    • Crossref
    • Export Citation
  • Etheridge, A. (2011): Some mathematical models from population genetics: École D’Été de Probabilités de Saint-Flour XXXIX-2009, Vol. 39, Heidelberg: Springer Science & Business Media.

  • Ethier, S. N. and R. C. Griffiths (1993): “The transition function of a Fleming-Viot process,” Ann. Probab., 21, 1571–1590.

    • Crossref
    • Export Citation
  • Ferrer-Admetlla, A., C. Leuenberger, J. D. Jensen and D. Wegmann (2016): “An approximate Markov model for the Wright-Fisher diffusion and its application to time series data,” Genetics, 203, 831–846.

    • Crossref
    • PubMed
    • Export Citation
  • Fisher, R. A. (1930): The genetical theory of natural selection, Oxford: Clarendon Press.

  • Foll, M., H. Shim and J. D. Jensen (2015): “WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data,” Mol. Ecol. Resour., 15, 87–98.

    • Crossref
    • PubMed
    • Export Citation
  • Golightly, A. and D. J. Wilkinson (2008): “Bayesian inference for nonlinear multivariate diffusion models observed with error,” Comput. Stat. Data Anal., 52, 1674–1693.

    • Crossref
    • Export Citation
  • Griffiths, R. C. (1979): “A transition density expansion for a multi-allele diffusion model,” Adv. Appl. Probab., 11, 310–325.

    • Crossref
    • Export Citation
  • Griffiths, R. C. and D. Spanò (2013): “Orthogonal polynomial kernels and canonical correlations for Dirichlet measures,” Bernoulli, 19, 548–598.

    • Crossref
    • Export Citation
  • Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson and C. D. Bustamante (2009): “Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data,” PLoS Genet., 5, e10000695.

    • PubMed
    • Export Citation
  • Gutenkunst, R., R. Hernandez, S. Williamson and C. Bustamante (2010): “Diffusion approximations for demographic inference: DaDi,” Nat. Prec., <http://hdl.handle.net/10101/npre.2010.4594.1>.

  • Hey, J. and R. Nielsen (2007): “Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics,” Proc. Natl. Acad. Sci. USA, 104, 2785–2790.

    • Crossref
    • Export Citation
  • Jenkins, P. A. and D. Spanó (2015): “Exact simulation of the Wright-Fisher diffusion,” arXiv:1506.06998v1.

  • Jewett, E. M., M. Steinrücken and Y. S. Song (2016): “The effects of population size histories on estimates of selection coefficients from time-series genetic data,” Mol. Biol. Evol., 33, 3002–3027.

    • Crossref
    • PubMed
    • Export Citation
  • Karatzas, I. and S. Shreve (1998): Brownian motion and stochastic calculus, vol. 113, New York: Springer Science & Business Media.

  • Kim, Y. and R. Nielsen (2004): “Linkage disequilibrium as a signature of selective sweeps,” Genetics, 167, 1513–1524.

    • Crossref
    • PubMed
    • Export Citation
  • Kingman, J. F. C. (1982a): “The coalescent,” Stoch. Process. Appl., 13, 235–248.

    • Crossref
    • Export Citation
  • Kingman, J. F. C. (1982b): “Exchangeability and the evolution of large populations.” In: Koch, G. and F. Spizzichino, editors, Exchangeability in Probability and Statistics, North-Holland, Amsterdam, pp. 97–112.

  • Kingman, J. F. C. (1982c): “On the genealogy of large populations,” J. Appl. Probab., 19A, 27–43.

  • Kloeden, P. E. and E. Platen (1992): Numerical solution of stochastic differential equations (stochastic modelling and applied probability), Springer-Verlag.

  • Knowles, L. L. and L. S. Kubatko (2010): Estimating species trees: practical and theoretical aspects, Wiley-Blackwell.

  • Leache, A. D., R. B. Harris, B. Rannala and Z. Yang (2014): “The influence of gene flow on species tree estimation: a simulation study,” Syst. Biol., 63, 17–30.

    • Crossref
    • PubMed
    • Export Citation
  • Li, H. and W. Stephan (2005): “Maximum-likelihood methods for detecting recent positive selection and localizing the selected site in the genome,” Genetics, 171, 377–384.

    • Crossref
    • PubMed
    • Export Citation
  • Lin, M., R. Chen and P. Mykland (2010): “On generating Monte Carlo samples of continuous diffusion bridges,” J. Am. Stat. Assoc., 105, 820–838.

    • Crossref
    • Export Citation
  • Lo, A. W. (1988): “Maximum likelihood estimation of generalized Itô processes with discretely sampled data,” Econ. Theor., 4, 231–247.

    • Crossref
    • Export Citation
  • Malaspinas, A.-S., O. Malaspinas, S. N. Evans and M. Slatkin (2012): “Estimating allele age and selection coefficient from time-serial data,” Genetics, 192, 599–607.

    • Crossref
    • PubMed
    • Export Citation
  • Nielsen, R. and J. Wakeley (2001): “Distinguishing migration from isolation: a Markov chain Monte Carlo approach,” Genetics, 158, 885–896.

    • PubMed
    • Export Citation
  • Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark and C. Bustamante (2005): “Genomic scans for selective sweeps using SNP data,” Genome Res., 15, 1566–1575.

    • Crossref
    • PubMed
    • Export Citation
  • Pavlidis, P., D. Živković, A. Stamatakis and N. Alachiotis (2013): “SweeD: likelihood-based detection of selective sweeps in thousands of genomes,” Mol. Biol. Evol., 30, 2224–2234.

    • Crossref
    • PubMed
    • Export Citation
  • Pedersen, A. R. (1995): “A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations,” Scand. J. Stat., 22, 55–71.

  • Robinson, J. D., A. J. Coffman, M. J. Hickerson and R. N. Gutenkunst (2014): “Sampling strategies for frequency spectrum-based population genomic inference,” BMC Evol. Biol., 14, 254.

    • Crossref
    • PubMed
    • Export Citation
  • Ronen, R., N. Upda, E. Halperin and V. Bafna (2013): “Learning natural selection from the site frequency spectrum,” Genetics, 195, 181–193.

    • Crossref
    • PubMed
    • Export Citation
  • Schraiber, J. G., S. N. Evans and M. Slatkin (2016): “Bayesian inference of natural selection from allele frequency time series,” Genetics, 203, 493–511.

    • Crossref
    • PubMed
    • Export Citation
  • Scott, D. W. (1992): Multivariate density estimation: theory, practice, and visualization, John Wiley.

  • Song, Y. S. and M. Steinrücken (2012): “A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection,” Genetics, 190, 1117–1129.

    • Crossref
    • PubMed
    • Export Citation
  • Steinrücken, M., Y. X. R. Wang and Y. S. Song (2013): “An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection,” Theor. Popul. Biol., 83, 1–14.

    • Crossref
    • PubMed
    • Export Citation
  • Sun, L., C. Lee and J. A. Hoeting (2015): “Parameter inference and model selection in deterministic and stochastic dynamical models via approximate bayesian computation: modeling a wildlife epidemic,” Environmetrics, 26, 451–462.

    • Crossref
    • Export Citation
  • Tavaré, S. (1984): “Line-of-descent and genealogical processes, and their applications in population genetics models,” Theor. Popul. Biol., 26, 119–164.

    • Crossref
    • PubMed
    • Export Citation
  • Wakeley, J. (2009): Coalescent theory: an introduction, Roberts and Company.

  • Wakeley, J. and J. Hey (1997): “Estimating ancestral population parameters,” Genetics, 145, 847–855.

    • PubMed
    • Export Citation
  • Williamson, S. H., R. Hernandez, A. Fledel-Alon, L. Zhu, R. Nielsen and C. D. Bustamante (2005): Simultaneous inference of selection and population growth from patterns of variation in the human genome,” Proc. Natl. Acad. Sci., 102, 7882–7887.

    • Crossref
    • Export Citation
  • Wright, S. (1931): “Evolution in Mendelian populations,” Genetics, 16, 97–159.

    • PubMed
    • Export Citation
  • Wutke, S., N. Benecke, E. Sandoval-Castellanos, H.-J. Dohle, S. Friederich, J. Gonzalez, J. H. Hallsson, M. Hofreiter, L. Lougas, O. Magnell, A. Morales-Muniz, L. Orlando, A. H. Palsdottir, M. Reissmann, M. Ruttkay, A. Trinks and A. Ludwig (2016): Spotted phenotypes in horses lost attractiveness in the Middle Ages,” Sci. Rep., 6, 38548.

    • Crossref
    • PubMed
    • Export Citation
  • Zhou, D., N. Upda, M. Gersten, D. W. Visk, A. Bashir, J. Xue, K. A. Frazer, J. W. Posakony, S. Subramaniam, V. Bafna and G. G. Haddad (2011): Experimental selection of hypoxia-tolerant Drosophila melanogaster,” Proc. Natl. Acad. Sci., 108, 2349–2354.

    • Crossref
    • Export Citation
  • Živković, D., M. Steinrücken, Y. S. Song and W. Stephan (2015): “Transition densities and sample frequency spectra of diffusion processes with selection and variable population size,” Genetics, 200, 601–617.

    • Crossref
    • PubMed
    • Export Citation
Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

SAGMB publishes significant research on the application of statistical ideas to problems arising from computational biology. The range of topics includes linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarrary data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

Search