Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

6 Issues per year

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2017: 0.04

See all formats and pricing
More options …
Volume 13, Issue 4


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Protein domain hierarchy Gibbs sampling strategies

Andrew F. Neuwald
  • Corresponding author
  • Institute for Genome Sciences and Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, 801 West Baltimore St., Baltimore, MD 21201, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2014-07-02 | DOI: https://doi.org/10.1515/sagmb-2014-0008


Hierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed statistical evolutionary dynamics analysis to accomplish this task automatically while, at the same time, identifying sequence determinants of protein function. Here I describe the statistical model and sampling strategies underlying this sampler. When implemented and applied to simulated protein sequences (which conform to the underlying statistical model precisely), these sampling strategies efficiently converge on the hierarchy used to generate the sequences. However, for real protein sequences the sampler finds alternative, nearly-optimal hierarchies for many domains, indicating a significant degree of ambiguity. I illustrate how both the nature of such ambiguities and the most robust (“consensus”) features of a hierarchy may be determined from an ensemble of independently generated hierarchies for the same domain. Such consensus hierarchies can provide reliably stable models of protein domain functional divergence.

This article offers supplementary material which is provided at the end of the article.

Keywords: Markov chain Monte Carlo; computer algorithm; Bayesian statistics; protein sequence analysis


  • Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman (1997): “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., 25, 3389–3402.CrossrefGoogle Scholar

  • Altschul, S. F., E. M. Gertz, R. Agarwala, A. A. Schaffer and Y. K. Yu (2009): “PSI-BLAST pseudocounts and the minimum description length principle,” Nucleic Acids Res., 37, 815–824.CrossrefWeb of SciencePubMedGoogle Scholar

  • Bouckaert, R., J. Heled, D. Kuhnert, T. Vaughan, C. H. Wu, D. Xie, M. A. Suchard, A. Rambaut and A. J. Drummond (2014): “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis,” PLoS Comput. Biol., 10, e1003537.Web of ScienceCrossrefGoogle Scholar

  • Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt (1978): “A model of evolutionary change in proteins,” Atlas of Protein Sequence and Structure, 5, 345–352.Google Scholar

  • Felsenstein, J. (1985): “Confidence limits on phylogenies: an approach using the bootstrap,” Evolution, 39, 783–791.CrossrefGoogle Scholar

  • Finn, R. D., J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H. R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. Sonnhammer and A. Bateman (2008): “The Pfam protein families database,” Nucleic Acids Res., 36, D281–D288.Google Scholar

  • Grunwald, P. D. (2007): The minimum description length principle, MIT Press: Boston.Google Scholar

  • Henikoff, S. and J. G. Henikoff (1992): “Amino acid substitution matrices from protein blocks,” Proc. Natl. Acad Sci. USA, 89, 10915–10919.CrossrefGoogle Scholar

  • Henikoff, J. G. and S. Henikoff (1996): “Using substitution probabilities to improve position-specific scoring matrices,” Comput. Appl. Biosci., 12, 135–143.PubMedGoogle Scholar

  • Holder, M. and P. O. Lewis (2003): “Phylogeny estimation: traditional and Bayesian approaches,” Nat. Rev. Genet., 4, 275–284.PubMedCrossrefGoogle Scholar

  • Huelsenbeck, J. P. and F. Ronquist (2001): “MRBAYES: Bayesian inference of phylogenetic trees,” Bioinformatics, 17, 754–755.CrossrefPubMedGoogle Scholar

  • Huelsenbeck, J. P., B. Larget, R. E. Miller and F. Ronquist (2002): “Potential applications and pitfalls of Bayesian inference of phylogeny,” Syst. Biol., 51, 673–688.PubMedGoogle Scholar

  • Lin, M., R. Chen and J. S. Liu (2013): “Lookahead Strategies for Sequential Monte Carlo,” Stat. Sci., 28, 69–94.CrossrefWeb of ScienceGoogle Scholar

  • Liu, J. S. (1994): “The collapsed Gibbs sampler with applications to a gene regulation problem,” J. Am. Stat. Assoc., 89, 958–966.CrossrefGoogle Scholar

  • Liu, J. S. (2008) “Monte Carlo strategies in scientific computing. of Springer Series in Statistics, Springer-Verlag: New York.Google Scholar

  • Liu, J. S., W. H. Wong and A. Kong (1994): “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes,” Biometrika, 81, 27–40.CrossrefGoogle Scholar

  • Marchler-Bauer, A., A. R. Panchenko, B. A. Shoemaker, P. A. Thiessen, L. Y. Geer and S. H. Bryant (2002): “CDD: a database of conserved domain alignments with links to domain three-dimensional structure,” Nucleic Acids Res., 30, 281–283.Web of ScienceCrossrefGoogle Scholar

  • Marchler-Bauer, A., S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-Scott, J. H. Fong, L. Y. Geer, R. C. Geer, N. R. Gonzales, M. Gwadz, D. I. Hurwitz, J. D. Jackson, Z. Ke, C. J. Lanczycki, F. Lu, G. H. Marchler, M. Mullokandov, M. V. Omelchenko, C. L. Robertson, J. S. Song, N. Thanki, R. A. Yamashita, D. Zhang, N. Zhang, C. Zheng and S. H. Bryant (2011): “CDD: a Conserved Domain Database for the functional annotation of proteins,” Nucleic Acids Res., 39, D225–D229.Web of ScienceCrossrefGoogle Scholar

  • Neuwald, A. F. (2006): “Bayesian shadows of molecular mechanisms cast in the light of evolution,” Trends Biochem Sciences, 31, 374–382.CrossrefGoogle Scholar

  • Neuwald, A. F. (2009): “Rapid detection, classification and accurate alignment of up to a million or more related protein sequences,” Bioinformatics, 25, 1869–1875.CrossrefPubMedGoogle Scholar

  • Neuwald, A. F. (2011): “Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms,” Statistical Applications in Genetics and Molecular Biology, 10, 36.Web of ScienceGoogle Scholar

  • Neuwald, A. F. (2014a): “A Bayesian sampler for optimization of protein domain hierarchies,” J. Comput. Biol., 21, 269–286.CrossrefWeb of ScienceGoogle Scholar

  • Neuwald, A. F. (2014b): “Evaluating, comparing and interpreting protein domain hierarchies,” J. Comput. Biol., 21, 287–302.Web of ScienceCrossrefGoogle Scholar

  • Neuwald, A. F. and J. S. Liu (2004): “Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model,” BMC Bioinformatics, 5, 157.CrossrefPubMedGoogle Scholar

  • Nguyen, V. A., J. Boyd-Graber and S. F. Altschul (2013): “Dirichlet mixtures, the Dirichlet process and the structure of protein space,” J. Comput. Biol., 20, 1–18.CrossrefWeb of ScienceGoogle Scholar

  • Suchard, M. A. and B. D. Redelings (2006): “BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny,” Bioinformatics, 22, 2047–2048.PubMedCrossrefGoogle Scholar

  • Walker, S. G. (2009): “A Gibbs sampling alternative to reversible jump MCMC,” Technical report arXiv:0902.4117.Google Scholar

About the article

Corresponding author: Andrew F. Neuwald, Institute for Genome Sciences and Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, 801 West Baltimore St., Baltimore, MD 21201, USA, e-mail:

Published Online: 2014-07-02

Published in Print: 2014-08-01

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 13, Issue 4, Pages 497–517, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2014-0008.

Export Citation

© 2014 by De Gruyter.Get Permission

Supplementary Article Materials

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Andrew F Neuwald, L Aravind, and Stephen F Altschul
eLife, 2018, Volume 7
Andrew F. Neuwald, Stephen F. Altschul, and Christine A. Orengo
PLOS Computational Biology, 2016, Volume 12, Number 5, Page e1004936
Andrew F. Neuwald, Stephen F. Altschul, and Marco Punta
PLOS Computational Biology, 2016, Volume 12, Number 12, Page e1005294

Comments (0)

Please log in or register to comment.
Log in