Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 2, 2014

Protein domain hierarchy Gibbs sampling strategies

Andrew F. Neuwald


Hierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed statistical evolutionary dynamics analysis to accomplish this task automatically while, at the same time, identifying sequence determinants of protein function. Here I describe the statistical model and sampling strategies underlying this sampler. When implemented and applied to simulated protein sequences (which conform to the underlying statistical model precisely), these sampling strategies efficiently converge on the hierarchy used to generate the sequences. However, for real protein sequences the sampler finds alternative, nearly-optimal hierarchies for many domains, indicating a significant degree of ambiguity. I illustrate how both the nature of such ambiguities and the most robust (“consensus”) features of a hierarchy may be determined from an ensemble of independently generated hierarchies for the same domain. Such consensus hierarchies can provide reliably stable models of protein domain functional divergence.

Corresponding author: Andrew F. Neuwald, Institute for Genome Sciences and Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, 801 West Baltimore St., Baltimore, MD 21201, USA, e-mail:


This work was supported by the School of Medicine at the University of Maryland, Baltimore and by a contract from the NIH (HHSN263000099957I). I thank John L. Spouge for critical reading of the manuscript.

Disclosure statement:

No competing financial interests exist.


Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman (1997): “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., 25, 3389–3402.Search in Google Scholar

Altschul, S. F., E. M. Gertz, R. Agarwala, A. A. Schaffer and Y. K. Yu (2009): “PSI-BLAST pseudocounts and the minimum description length principle,” Nucleic Acids Res., 37, 815–824.Search in Google Scholar

Bouckaert, R., J. Heled, D. Kuhnert, T. Vaughan, C. H. Wu, D. Xie, M. A. Suchard, A. Rambaut and A. J. Drummond (2014): “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis,” PLoS Comput. Biol., 10, e1003537.Search in Google Scholar

Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt (1978): “A model of evolutionary change in proteins,” Atlas of Protein Sequence and Structure, 5, 345–352.Search in Google Scholar

Felsenstein, J. (1985): “Confidence limits on phylogenies: an approach using the bootstrap,” Evolution, 39, 783–791.10.1111/j.1558-5646.1985.tb00420.xSearch in Google Scholar PubMed

Finn, R. D., J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H. R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. Sonnhammer and A. Bateman (2008): “The Pfam protein families database,” Nucleic Acids Res., 36, D281–D288.Search in Google Scholar

Grunwald, P. D. (2007): The minimum description length principle, MIT Press: Boston.10.7551/mitpress/4643.001.0001Search in Google Scholar

Henikoff, S. and J. G. Henikoff (1992): “Amino acid substitution matrices from protein blocks,” Proc. Natl. Acad Sci. USA, 89, 10915–10919.10.1073/pnas.89.22.10915Search in Google Scholar PubMed PubMed Central

Henikoff, J. G. and S. Henikoff (1996): “Using substitution probabilities to improve position-specific scoring matrices,” Comput. Appl. Biosci., 12, 135–143.Search in Google Scholar

Holder, M. and P. O. Lewis (2003): “Phylogeny estimation: traditional and Bayesian approaches,” Nat. Rev. Genet., 4, 275–284.Search in Google Scholar

Huelsenbeck, J. P. and F. Ronquist (2001): “MRBAYES: Bayesian inference of phylogenetic trees,” Bioinformatics, 17, 754–755.10.1093/bioinformatics/17.8.754Search in Google Scholar PubMed

Huelsenbeck, J. P., B. Larget, R. E. Miller and F. Ronquist (2002): “Potential applications and pitfalls of Bayesian inference of phylogeny,” Syst. Biol., 51, 673–688.Search in Google Scholar

Lin, M., R. Chen and J. S. Liu (2013): “Lookahead Strategies for Sequential Monte Carlo,” Stat. Sci., 28, 69–94.Search in Google Scholar

Liu, J. S. (1994): “The collapsed Gibbs sampler with applications to a gene regulation problem,” J. Am. Stat. Assoc., 89, 958–966.Search in Google Scholar

Liu, J. S. (2008) “Monte Carlo strategies in scientific computing. of Springer Series in Statistics, Springer-Verlag: New York.Search in Google Scholar

Liu, J. S., W. H. Wong and A. Kong (1994): “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes,” Biometrika, 81, 27–40.10.1093/biomet/81.1.27Search in Google Scholar

Marchler-Bauer, A., A. R. Panchenko, B. A. Shoemaker, P. A. Thiessen, L. Y. Geer and S. H. Bryant (2002): “CDD: a database of conserved domain alignments with links to domain three-dimensional structure,” Nucleic Acids Res., 30, 281–283.Search in Google Scholar

Marchler-Bauer, A., S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-Scott, J. H. Fong, L. Y. Geer, R. C. Geer, N. R. Gonzales, M. Gwadz, D. I. Hurwitz, J. D. Jackson, Z. Ke, C. J. Lanczycki, F. Lu, G. H. Marchler, M. Mullokandov, M. V. Omelchenko, C. L. Robertson, J. S. Song, N. Thanki, R. A. Yamashita, D. Zhang, N. Zhang, C. Zheng and S. H. Bryant (2011): “CDD: a Conserved Domain Database for the functional annotation of proteins,” Nucleic Acids Res., 39, D225–D229.Search in Google Scholar

Neuwald, A. F. (2006): “Bayesian shadows of molecular mechanisms cast in the light of evolution,” Trends Biochem Sciences, 31, 374–382.10.1016/j.tibs.2006.05.002Search in Google Scholar PubMed

Neuwald, A. F. (2009): “Rapid detection, classification and accurate alignment of up to a million or more related protein sequences,” Bioinformatics, 25, 1869–1875.10.1093/bioinformatics/btp342Search in Google Scholar PubMed PubMed Central

Neuwald, A. F. (2011): “Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms,” Statistical Applications in Genetics and Molecular Biology, 10, 36.10.2202/1544-6115.1666Search in Google Scholar PubMed PubMed Central

Neuwald, A. F. (2014a): “A Bayesian sampler for optimization of protein domain hierarchies,” J. Comput. Biol., 21, 269–286.10.1089/cmb.2013.0099Search in Google Scholar PubMed PubMed Central

Neuwald, A. F. (2014b): “Evaluating, comparing and interpreting protein domain hierarchies,” J. Comput. Biol., 21, 287–302.10.1089/cmb.2013.0098Search in Google Scholar PubMed PubMed Central

Neuwald, A. F. and J. S. Liu (2004): “Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model,” BMC Bioinformatics, 5, 157.10.1186/1471-2105-5-157Search in Google Scholar PubMed PubMed Central

Nguyen, V. A., J. Boyd-Graber and S. F. Altschul (2013): “Dirichlet mixtures, the Dirichlet process and the structure of protein space,” J. Comput. Biol., 20, 1–18.Search in Google Scholar

Suchard, M. A. and B. D. Redelings (2006): “BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny,” Bioinformatics, 22, 2047–2048.10.1093/bioinformatics/btl175Search in Google Scholar PubMed

Walker, S. G. (2009): “A Gibbs sampling alternative to reversible jump MCMC,” Technical report arXiv:0902.4117.Search in Google Scholar

Supplemental Material

The online version of this article (DOI: 10.1515/sagmb-2014-0008) offers supplementary material, available to authorized users.

Published Online: 2014-7-2
Published in Print: 2014-8-1

© 2014 by De Gruyter

Scroll Up Arrow