Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

6 Issues per year

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2017: 0.04

See all formats and pricing
More options …
Volume 17, Issue 4


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Comparisons of classification methods for viral genomes and protein families using alignment-free vectorization

Hsin-Hsiung Huang
  • Corresponding author
  • Department of Statistics, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL 32816, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Shuai Hao
  • Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Saul Alarcon
  • Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Jie Yang
  • Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2018-06-30 | DOI: https://doi.org/10.1515/sagmb-2018-0004


In this paper, we propose a statistical classification method based on discriminant analysis using the first and second moments of positions of each nucleotide of the genome sequences as features, and compare its performances with other classification methods as well as natural vector for comparative genomic analysis. We examine the normality of the proposed features. The statistical classification models used including linear discriminant analysis, quadratic discriminant analysis, diagonal linear discriminant analysis, k-nearest-neighbor classifier, logistic regression, support vector machines, and classification trees. All these classifiers are tested on a viral genome dataset and a protein dataset for predicting viral Baltimore labels, viral family labels, and protein family labels.

Keywords: viral genomes; protein; family labels; Natural Vector; statistical classification models


  • Baltimore, D. (1971): “Expression of animal virus genomes,” Bacteriol. Rev. 35 (3), 235–241.PubMedGoogle Scholar

  • Chan, R. H., R. W. Wang and H. M. Yeung (2010): “Composition vector method for phylogenetics-a review,” Proc. 9th International Symposium on Operations Research and its Applications, 13–20.Google Scholar

  • Cortes, C. and V. Vapnik (1995): “Support-vector networks,” Machine Learning, 20, 273–297.CrossrefGoogle Scholar

  • Darling, D. A. (1975): “Note on a limit theorem,” Ann. Probab. 3, 876–878.CrossrefGoogle Scholar

  • Deng, M., C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau (2011): “A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications,” PLoS One, 6 (3), e17293.PubMedWeb of ScienceGoogle Scholar

  • Dudoit, S., J. Fridlyand, and T. P. Speed (2002): “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Stat. Assoc., 97, 77–87.CrossrefGoogle Scholar

  • Fawcett, T. (2006): “An introduction to ROC analysis,” Pattern Recognit. Lett., 27, 861–874.CrossrefGoogle Scholar

  • Friedman, J. H. (1989): “Regularized discriminant analysis,” J. Am. Stat. Assoc., 84, 165–175.CrossrefGoogle Scholar

  • Ghor, B., D. Horn, N. Goldman, Y. Levy, and T. Massingham (2009): “Genomic DNA k-mer spectra: models and modalities,” Genome Biol., 10, R108.Google Scholar

  • Hand, D. J. and R. J. Till (2001): “A simple generalisation of the area under the ROC curve for multiple class classification problems,” Mach. Learn., 45: 171.CrossrefGoogle Scholar

  • Hastie, T., R. Tibshirani, and J. Friedman (2009): The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer, New York.Google Scholar

  • Hernandez, T. and J. Yang (2013): “Descriptive statistics of the genome: phylogenetic classification of viruses,” J. Comput. Biol., 23, 810–820.Web of ScienceGoogle Scholar

  • Hoang, T., C. Yin, H. Zheng, C. Yu, L. R. He, and S. S.-T. Yau (2015): “A new method to cluster DNA sequences using Fourier power spectrum,” J. Theor. Biol., 372, 135–145.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Huang, G. H., H. Q. Zhou, Y. F. Li, and L. X. Xu (2011): “Alignment-free comparison of genome sequences by a new numerical characterization,” J. Theor. Biol., 281, 107–112.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Huang, G. H. (2014): “A novel neighborhood model to predict protein function from protein-protein interaction data,” Current Bioinformatics,” 11, 237–244.Google Scholar

  • Huang, H.-H., T. Xu, and J. Yang (2014a): “Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension,” BMC Proceedings, 8 (Suppl 1), S96.CrossrefGoogle Scholar

  • Huang, H.-H., C. Yu, H. Zheng, T. Hernandez, S.-C. Yau, R. L. He, J. Yang, S. S.-T. Yau (2014b): “Global comparison of multiple-segmented viruses in 12-dimensional genome space,” Mol. Phylogenet. Evol., 81, 29–36.Web of ScienceCrossrefGoogle Scholar

  • Huang, H.-H. (2016): “An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses,” J. Theor. Biol., 398, 136–144.Web of SciencePubMedCrossrefGoogle Scholar

  • Huang, G. H., C. Chu, T. Huang, X. Kong, Y. Zhang, N. Zhang, and Y.-D. Cai (2016): “Exploring mouse protein function via multiple approaches,” PLoS One, 11, e0166580.Web of SciencePubMedGoogle Scholar

  • Huang, H.-H. and S.-B. Girimurugan (2018): “A novel real-time genome comparison method using discrete wavelet transform,” J. Comput. Biol., 25, 405–416.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Maddouri, M. and M. Elloumi (2002): “A data mining approach based on machine learning techniques to classify biological sequences,” Knowl. Based Syst., 15, 2002.Google Scholar

  • National Center for Biotechnology Information (NCBI)[Internet]. (2016): Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Available from: https://www.ncbi.nlm.nih.gov/.Google Scholar

  • Polychronopoulos, D., E. Weitschek, S. Dimitrieva, P. Bucher, G. Felici, and Y. Almirantis (2014): “Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers,” Genomics 104, 79–86.CrossrefWeb of SciencePubMedGoogle Scholar

  • Rao, C. R. and S. K. Mitra (1972): “Generalized inverse of a matrix and its applications,” Proc. Sixth Berkeley Symp. on Math. Statist. and Prob., Vol. 1, Univ. of Calif. Press, 601–620.Google Scholar

  • Selcuk, K., G. Dincer, and Z. Gokmen (2016): MVN: an R package for assessing multivariate normality. R package vignettes.Google Scholar

  • Sims, G. E., S. R. Jun, G. A. Wu, and S. H. Kim (2009): “Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions,” Proc. Natl. Acad. Sci. U.S.A. 106, 2677–2682.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Vinga, S. and J. Almeida (2003): “Alignment-free sequence comparison review.” Bioinformatics, 19, 513–523.CrossrefPubMedGoogle Scholar

  • Vinga, S. (2007): Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. In: Pham, T. D., Yan, H., Crane, D. I. (Eds.), Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers, New York.Google Scholar

  • Weitschek, E., F. Cunial and G. Felici (2015): “LAF: logic alignment free and its application to bacterial genomes classification,” BioData Min., 8, 39.PubMedCrossrefWeb of ScienceGoogle Scholar

  • Yu, C., T. Hernandez, H. Zheng, S.-C. Yau, H.-H. Huang, R. L. He, J. Yang, and S. S.-T. Yau (2013): “Real time classification of viruses in 12 dimensions,” PLoS One, 8, e64328.Web of SciencePubMedGoogle Scholar

About the article

Published Online: 2018-06-30

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 17, Issue 4, 20180004, ISSN (Online) 1544-6115, DOI: https://doi.org/10.1515/sagmb-2018-0004.

Export Citation

©2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in