Hierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed statistical evolutionary dynamics analysis to accomplish this task automatically while, at the same time, identifying sequence determinants of protein function. Here I describe the statistical model and sampling strategies underlying this sampler. When implemented and applied to simulated protein sequences (which conform to the underlying statistical model precisely), these sampling strategies efficiently converge on the hierarchy used to generate the sequences. However, for real protein sequences the sampler finds alternative, nearly-optimal hierarchies for many domains, indicating a significant degree of ambiguity. I illustrate how both the nature of such ambiguities and the most robust (“consensus”) features of a hierarchy may be determined from an ensemble of independently generated hierarchies for the same domain. Such consensus hierarchies can provide reliably stable models of protein domain functional divergence.
This work was supported by the School of Medicine at the University of Maryland, Baltimore and by a contract from the NIH (HHSN263000099957I). I thank John L. Spouge for critical reading of the manuscript.
No competing financial interests exist.
Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman (1997): “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., 25, 3389–3402.Search in Google Scholar
Altschul, S. F., E. M. Gertz, R. Agarwala, A. A. Schaffer and Y. K. Yu (2009): “PSI-BLAST pseudocounts and the minimum description length principle,” Nucleic Acids Res., 37, 815–824.Search in Google Scholar
Bouckaert, R., J. Heled, D. Kuhnert, T. Vaughan, C. H. Wu, D. Xie, M. A. Suchard, A. Rambaut and A. J. Drummond (2014): “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis,” PLoS Comput. Biol., 10, e1003537.Search in Google Scholar
Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt (1978): “A model of evolutionary change in proteins,” Atlas of Protein Sequence and Structure, 5, 345–352.Search in Google Scholar
Finn, R. D., J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H. R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. Sonnhammer and A. Bateman (2008): “The Pfam protein families database,” Nucleic Acids Res., 36, D281–D288.Search in Google Scholar
Henikoff, S. and J. G. Henikoff (1992): “Amino acid substitution matrices from protein blocks,” Proc. Natl. Acad Sci. USA, 89, 10915–10919.10.1073/pnas.89.22.10915Search in Google Scholar PubMed PubMed Central
Henikoff, J. G. and S. Henikoff (1996): “Using substitution probabilities to improve position-specific scoring matrices,” Comput. Appl. Biosci., 12, 135–143.Search in Google Scholar
Holder, M. and P. O. Lewis (2003): “Phylogeny estimation: traditional and Bayesian approaches,” Nat. Rev. Genet., 4, 275–284.Search in Google Scholar
Huelsenbeck, J. P., B. Larget, R. E. Miller and F. Ronquist (2002): “Potential applications and pitfalls of Bayesian inference of phylogeny,” Syst. Biol., 51, 673–688.Search in Google Scholar
Lin, M., R. Chen and J. S. Liu (2013): “Lookahead Strategies for Sequential Monte Carlo,” Stat. Sci., 28, 69–94.Search in Google Scholar
Liu, J. S. (1994): “The collapsed Gibbs sampler with applications to a gene regulation problem,” J. Am. Stat. Assoc., 89, 958–966.Search in Google Scholar
Liu, J. S. (2008) “Monte Carlo strategies in scientific computing. of Springer Series in Statistics, Springer-Verlag: New York.Search in Google Scholar
Liu, J. S., W. H. Wong and A. Kong (1994): “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes,” Biometrika, 81, 27–40.10.1093/biomet/81.1.27Search in Google Scholar
Marchler-Bauer, A., A. R. Panchenko, B. A. Shoemaker, P. A. Thiessen, L. Y. Geer and S. H. Bryant (2002): “CDD: a database of conserved domain alignments with links to domain three-dimensional structure,” Nucleic Acids Res., 30, 281–283.Search in Google Scholar
Marchler-Bauer, A., S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-Scott, J. H. Fong, L. Y. Geer, R. C. Geer, N. R. Gonzales, M. Gwadz, D. I. Hurwitz, J. D. Jackson, Z. Ke, C. J. Lanczycki, F. Lu, G. H. Marchler, M. Mullokandov, M. V. Omelchenko, C. L. Robertson, J. S. Song, N. Thanki, R. A. Yamashita, D. Zhang, N. Zhang, C. Zheng and S. H. Bryant (2011): “CDD: a Conserved Domain Database for the functional annotation of proteins,” Nucleic Acids Res., 39, D225–D229.Search in Google Scholar
Neuwald, A. F. (2009): “Rapid detection, classification and accurate alignment of up to a million or more related protein sequences,” Bioinformatics, 25, 1869–1875.10.1093/bioinformatics/btp342Search in Google Scholar PubMed PubMed Central
Neuwald, A. F. (2011): “Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms,” Statistical Applications in Genetics and Molecular Biology, 10, 36.10.2202/1544-6115.1666Search in Google Scholar PubMed PubMed Central
Neuwald, A. F. and J. S. Liu (2004): “Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model,” BMC Bioinformatics, 5, 157.10.1186/1471-2105-5-157Search in Google Scholar PubMed PubMed Central
Nguyen, V. A., J. Boyd-Graber and S. F. Altschul (2013): “Dirichlet mixtures, the Dirichlet process and the structure of protein space,” J. Comput. Biol., 20, 1–18.Search in Google Scholar
Suchard, M. A. and B. D. Redelings (2006): “BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny,” Bioinformatics, 22, 2047–2048.10.1093/bioinformatics/btl175Search in Google Scholar PubMed
Walker, S. G. (2009): “A Gibbs sampling alternative to reversible jump MCMC,” Technical report arXiv:0902.4117.Search in Google Scholar
The online version of this article (DOI: 10.1515/sagmb-2014-0008) offers supplementary material, available to authorized users.
© 2014 by De Gruyter