Monte Carlo estimation of total variation distance of Markov chains on large spaces, with application to phylogenetics

Radu Herbei 1  and Laura Kubatko 1
  • 1 The Ohio State University – Statistics, Columbus, OH, USA
Radu Herbei and Laura Kubatko

Abstract

Markov chains are widely used for modeling in many areas of molecular biology and genetics. As the complexity of such models advances, it becomes increasingly important to assess the rate at which a Markov chain converges to its stationary distribution in order to carry out accurate inference. A common measure of convergence to the stationary distribution is the total variation distance, but this measure can be difficult to compute when the state space of the chain is large. We propose a Monte Carlo method to estimate the total variation distance that can be applied in this situation, and we demonstrate how the method can be efficiently implemented by taking advantage of GPU computing techniques. We apply the method to two Markov chains on the space of phylogenetic trees, and discuss the implications of our findings for the development of algorithms for phylogenetic inference.

  • Aldous, D. (2000) “Mixing time for a Markov chain on cladograms,” Comb. Probab. Comput., 9, 191–204.

    • Crossref
  • Aldous, D. (2012) URL http://www.stat.berkeley.edu/~aldous/Research/OP/clad-mix.pdf

  • Conger, M. and D. Viswanath (2006): “Shuffling cards for blackjack, bridge and other card games,” http://arxiv.org/abs/math/0606031.

  • Cron, A. and M. West (2011) “Efficient classification-based relabeling in mixture models,” The American Statistician, 65, 16–20.

    • Crossref
  • Diaconis, P. W. and S. P. Holmes (1998) “Matchings and phylogenetic trees,” PNAS, 95, 14600–14602.

    • Crossref
  • Guindon, S. and O. Gascuel (2003) “A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood,” Syst. Biol., 52, 696–704.

    • Crossref
    • PubMed
  • L’Ecuyer, P., R. Simard, E. J. Chien and D. W. Kelton (2002) “An object-oriented random-number package with many long streams and substreams,” Oper. Res., 50, 1073–1075.

    • Crossref
  • Lee, L., C. Yau, M. B. Giles, A. Doucet and C. C. Homes (2010) “On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods,” J. Comput. Graph. Stat., 19, 769–789.

    • Crossref
  • Levin, D. A., Y. Peres and E. L. Wilmer (2009) “Markov chains and mixing times,” American Mathematical Society.

    • Crossref
  • Li, S., D. K. Pearl and H. Doss (2000) “Phylogenetic tree construction using Markov chain Monte Carlo,” J. Am. Stat. Assoc., 95, 493–508.

    • Crossref
  • Matsumoto, M. and T. Nishimura (1998) “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator,” ACM Transactions on Modeling and Computer Simulation, 8, 3–30.

    • Crossref
  • Mossel, E. and E. Vigoda (2005) “Phylogenetic MCMC algorithms are misleading on mixtures of trees,” Science, 309, 2207–2209.

    • Crossref
  • NVIDIA (2012a) “CUDA C Programming Guide Version 4.2,” http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA C Programming Guide.pdf.

  • NVIDIA (2012b) “CUDA Toolkit 4.2 CURAND Guide,” http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/CURAND Library.pdf.

  • Randall, D. and P. Tetali (1999) “Analyzing glauber dynamics by comparison of Markov chains,” Journal of Mathematical Physics, 41, 1598–1615.

    • Crossref
  • Ronquist, F., M. Teslenko, P. van der Mark, D. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu, M. A. Suchard and J. P. Huelsenbeck (2012) “Mrbayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space,” Syst. Biol., 6, 539–542.

    • Crossref
  • Salter, L. and D. K. Pearl (2001) “Stochastic search strategy for estimation of maximum likelihood phylogenetic trees,” Syst. Biol., 50, 7–17.

    • Crossref
    • PubMed
  • Schweinsberg, J. (2002) “An O(n 2) bound for the relaxation time of a Markov chain on cladograms,” Random Struct. Algor., 20, 59–70.

    • Crossref
  • Semple, C. and M. Steel (2003) Phylogenetics, Oxford University Press.

  • Spade, D., R. Herbei and L. Kubatko (2012) “A note on the relaxation time of two Markov chains on rooted phylogenetic tree spaces,” submitted (available upon request).

  • Stamatakis, A. (2006) “Maximum likelihood-based phylogenetic analysis with thousands of taxa and mixed models,” Bioinformatics, 4, 2688–2690.

    • Crossref
    • PubMed
  • Suchard, M. A. and A. Rambaut (2009) “Many-core algorithms for statistical phylogenetics,” Bioinformatics, 25, 1370–1376.

    • Crossref
    • PubMed
  • Suchard, M., Q. Wang, C. Chan, J. Frelinger, A. Cron and M. West (2010) “Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures,” J. Comput. Graph. Stat., 19, 419–438.

    • Crossref
  • Swofford, D. (2002) “PAUP*: phylogenetic analysis using parsimony (*and other methods), version 4.b10,” Sinauer Associates, Inc.

  • Yang, Z. and B. Rannala (1997) “Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method,” Mol. Biol. Evol., 14, 717–724.

    • Crossref
    • PubMed
  • Zwickl, D. (2006) “Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion,” Ph.D. Thesis, The University of Texas at Austin.

Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

SAGMB publishes significant research on the application of statistical ideas to problems arising from computational biology. The range of topics includes linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarrary data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

Search