Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

6 Issues per year

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2016: 0.06

See all formats and pricing
More options …
Volume 15, Issue 4


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches

Chamont Wang / Jana L. Gevertz
Published Online: 2016-05-25 | DOI: https://doi.org/10.1515/sagmb-2015-0072


Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal – we call these causative genes. We found that not one statistical method tested could identify all the causative genes for all of the simulated cancers, even though increasing the sample size does improve the variable selection capabilities in most cases. Furthermore, certain statistical tools can classify our simulated data with a low error rate, yet the variables being used for classification are not necessarily the causative genes.

Keywords: classification; false discovery rate; gene identification; shrinkage and regularization techniques; variable selection


  • Alon, U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack and A. J. Levine (1999): “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proc. Nat. Acad. Sci., 96, 6745–6750.Google Scholar

  • Anonymous (2006): “Making the most of microarrays,” Nat. Biotechnol., 24, 1039.Google Scholar

  • Anonymous (2010): “MAQC-II: Analyze that!,” Nat. Biotechnol., 28, 761.Google Scholar

  • Anonymous (2014): “A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium,” Nat. Biotechnol., 32, 903–914.Google Scholar

  • Assimes, T. L., J. W. Knowles, A. Basu, C. Iribarren, A. Southwick, H. Tang, D. Absher, J. Li, J. M. Fair, G. D. Rubin, S. Sidney, S. P. Fortmann, A. S. Go, M. A. Hlatky, R. M. Myers, N. Risch and T. Quertermous (2008): “Susceptibility locus for clinical and subclinical coronary artery disease at chromosome 9p21 in the multi-ethnic advance study,” Hum. Mol. Genet., 17, 2320–2328.Google Scholar

  • Bar, H., J. Booth, E. Schifano and M. T. Wells (2009): “Laplace approximated EM microarray analysis: an empirical bayes approach for comparative microarray experiments,” Statist. Sci., 25, 388–407.Google Scholar

  • Becker, N., W. Werft, G. Toedt, P. Lichter and A. Benner (2009): “PenalizedSVM: a R-package for feature selection SVM classification,” Bioinformatics, 25, 1711–1712.Google Scholar

  • Benjamini Y. and Y. Hochberg (1995): “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” J. R. Stat. Soc. Series B Stat. Methodol., 57, 289–300.Google Scholar

  • Bootkrajang, J. and A. Kabán (2013): “Classification of mislabelled microarrays using robust sparse logistic regression,” Bioinformatics, 29, 870–877.Google Scholar

  • Cordell, H. J. (2009): “Detecting gene-gene interactions that underlie human diseases,” Nat. Rev. Genet., 10, 392–404.Google Scholar

  • Dean, N. and A. E. Raftery (2010): “Latent class analysis variable selection,” Ann. Inst. Stat. Math., 62, 11–35.Google Scholar

  • Do, K. A., P. Müller and F. Tang (2005): “A Bayesian mixture model for differential gene expression,” J. R. Stat. Soc. Ser. C Appl. Stat., 54, 627–644.Google Scholar

  • Dudoit, S., J. P. Shaffer and J. C. Boldrick (2003): “Multiple hypothesis testing in microarray experiments,” Statist. Sci., 18, 71–103.Google Scholar

  • Efron, B. (2008): “Microarrays, empirical Bayes and the two-groups model,” Statist. Sci., 23, 1–22.CrossrefGoogle Scholar

  • Efron, B. (2010): “The future of indirect evidence,” Statist. Sci., 25, 145–157.Google Scholar

  • Efron, B. and N. Zhang (2011): “False discovery rates and copy number variation,” Biometrika, 98, 251–271.Google Scholar

  • Efron, B., T. Hastie, I. Johnstone and R. Tibshirani (2004): “Least angle regression,” Ann. Stat., 32, 407–499.Google Scholar

  • Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1438–1360.Google Scholar

  • Ferreira, J. A. and A. H. Zwinderman (2006): “On the Benjamini-Hochberg method,” Ann. Statist., 34, 1827–1849.Google Scholar

  • Freund, Y. (1995): “Boosting a weak learning algorithm by majority,” Inf. Comput., 121, 256–285.Google Scholar

  • Freund, Y. and R. E. Schapire (1996): “Experiments with a new boosting algorithm,” Machine Learning: Proc. 13th International Conference, 148–156.Google Scholar

  • Friedman, J. (2001): “Greedy function approximation: a gradient boosting machine,” Ann. Statist., 29, 1189–1232.Google Scholar

  • Friedman, J. (2006): “Recent advances in predictive (machine) learning,” J. Classif., 23, 175–197.Google Scholar

  • Friedman, J., T. Hastie and R. Tibshirani (2000): “Additive logistic regression: a statistical view of boosting (with discussion),” Ann. Statist., 28, 337–407.Google Scholar

  • Funke, B., A. K. Malhotra, C. T. Finn, A. M. Plocik, S. L. Lake, T. Lencz, P. DeRosse, J. M. Kane and R. Kucherlapati (2005): “COMT genetic variation confers risk for psychotic and affective disorders: a case control study,” Behav. Brain Funct., 1, 19.Google Scholar

  • Guyon, I. and A. Elisseeff (2003): “An introduction to variable and feature selection,” J. Mach. Learn. Res., 3, 1157–1182.Google Scholar

  • Guyon, I., J. Weston, S. Barnhill and V. Vapnik (2002): “Gene selection for cancer classification using support vector machines,” Mach. Learn., 46, 389–422.Google Scholar

  • Hand, D. J. (2006): “Classifier technology and the illusion of progress,” Statist. Sci., 21, 1–14.CrossrefGoogle Scholar

  • Hand, D. J. (2008): “Breast cancer diagnosis from proteomic mass spectrometry data: a comparative evaluation,” Stat. Appl. Genet. Mol. Biol., 7, 15.Google Scholar

  • Hand, D. J. (2012): “Assessing the Performance of Classification Methods,” Int. Stat. Rev., 80, 400–414.Google Scholar

  • Hastie, T., J. Friedman and R. Tibshirani (2009): “The Elements of Statistical Learning,” Springer-Verlag, New York, USA.Google Scholar

  • Hazai, E., I. Hazai, I. Ragueneau-Majlessi, S. P. Chung, Z. Bikadi and Q. C. Mao (2013): “Predicting substrates of the human breast cancer resistance protein using a support vector machine method,” BMC Bioinformatics, 14, 130.Google Scholar

  • Hu, Q., W. Pan, S. An, P. Ma and J. Wei (2010): “An efficient gene selection technique for cancer recognition based on neighborhood mutual information,” Int. J. Mach. Learn. Cyber., 1, 63–74.Google Scholar

  • Huang, J., P. Breheny and S. Ma (2012): “A selective review of group selection in high dimensional models”, Statist. Sci., 27, 481–499.Google Scholar

  • ICGC-TCGA DREAM Genomic Mutation Calling Challenge (https://www.synapse.org/#!Synapse:syn312572/wiki/), accessed 4/22/16.

  • Jamain, A. and D. J. Hand (2008): “Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation,” J. Classif., 25, 87–112.Google Scholar

  • Jeanmougin, M., A. de Reynies, L. Marisa, C. Paccard, G. Nuel and M. Guedj (2010): “Should we abandon the t-test in the analysis of gene expression microarray data: a comparison of variance modeling strategies,” PLoS One, 5, e12336.Google Scholar

  • Lee, Y. J., C. C. Chang and C. H. Chao (2008): “Incremental forward feature selection with application to microarray gene expression data,” J. Biopharm. Stat., 18, 827–840.Google Scholar

  • Leek, J. T. and J. D. Storey (2011): “The joint null criterion for multiple hypothesis tests,” Stat. Appl. Genet. Mol. Biol., 10, 28.Google Scholar

  • Lettre, G., C. D. Palmer, T. Young, K. G. Ejebe, H. Allayee, E. J. Benjamin, F. Bennett, D. W. Bowden, A. Chakravarti, A. Dreisbach, D. N. Farlow, A. R. Folsom, M. Fornage, T. Forrester, E. Fox, C. A. Haiman, J. Hartiala, T. B. Harris, S. L. Hazen, S. R. Heckbert, B. E. Henderson, J. N. Hirschhorn, B. J. Keating, S. B. Kritchevsky, E. Larkin, M. Li, M. E. Rudock, C. A. McKenzie, J. B. Meigs, Y. A. Meng, T. H. Mosley, A. B. Newman, C. H. Newton-Cheh, D. N. Paltoo, G. J. Papanicolaou, N. Patterson, W. S. Post, B. M. Psaty, A. N. Qasim, L. Qu, D. J. Rader, S. Redline, M. P. Reilly, A. P. Reiner, S. S. Rich, J. I. Rotter, Y. Liu, P. Shrader, D. S. Siscovick, W. H. Tang, H. A. Taylor, R. P. Tracy, R. S. Vasan, K. M. Waters, R. Wilks, J. G. Wilson, R. R. Fabsitz, S. B. Gabriel, S. Kathiresan and E. Boerwinkle. (2011): “Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project,” PLoS Genet., 7, e1001300.Google Scholar

  • Li, C. and M. Li (2008): “GWAsimulator: a rapid whole-genome simulation program,” Bioinformatics, 24, 140–142.Google Scholar

  • Ma, S., X. Song and J. Huang (2007): “Supervised group Lasso with applications to microarray data analysis,” BMC Bioinformatics, 8, 60.CrossrefGoogle Scholar

  • MAQC Consortium (2010): “The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models,” Nat. Biotechnol., 28, 827–838.Google Scholar

  • Michailidis, G. (2012): “Statistical challenges in biological networks,” J. Comput. Graph. Stat., 21, 840–855.Google Scholar

  • Mongan, M. A., R. T. Dunn, S. Vonderfecht, N. Everds, G. Chen, S. Cheng, M. Higgins-Garn, Y. Chen, C. A. Afshari, T. L. Williamson, L. Carlock, C. DiPalma, S. Moss and H. K. Hamadeh (2010) : “A novel statistical algorithm for gene expression analysis helps differentiate pregnane X receptor-dependent and independent mechanisms of toxicity,” PLoS One, 5, e15595.Google Scholar

  • Monti, S., P. Tamayo, J. Mesirov and T. Golu (2003): “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data,” Kluwer Academic Publishers, The Netherlands.Google Scholar

  • Park, M. Y. and T. Hastie (2008): “Penalized logistic regression for detecting gene interactions,” Biostatistics, 9, 30–50.Google Scholar

  • Pool, J. E., I. Hellmann, J. D. Jensen and R. Nielsen (2010): “Population genetic inference from genomic sequence variation,” Genome Res., 20, 291–300.Google Scholar

  • Ripke, S., B. M. Neale, A. Corvin, J. T. Walters, K. H. Farh, P. A. Holmans, P. Lee, B. Bulik-Sullivan, D. A. Collier, H. Huang, T. H. Pers, I. Agartz, E. Agerbo, M. Albus, M. Alexander, F. Amin, S. A. Bacanu, M. Begemann, R. A. Belliveau Jr, J. Bene, S. E. Bergen, E. Bevilacqua, T. B. Bigdeli, D. W. Black, R. Bruggeman, N. G. Buccola, R. L. Buckner, W. Byerley, W. Cahn, G. Cai, D. Campion, R. M. Cantor, V. J. Carr, N. Carrera, S. V. Catts, K. D. Chambert, R. C. Chan, R. Y. Chen, E. Y. Chen, W. Cheng, E. F. Cheung, S. A. Chong, C. R. Cloninger, D. Cohen, N. Cohen, P. Cormican, N. Craddock, J. J. Crowley, D. Curtis, M. Davidson, K. L. Davis, F. Degenhardt, J. Del Favero, D. Demontis, D. Dikeos, T. Dinan, S. Djurovic, G. Donohoe, E. Drapeau, J. Duan, F. Dudbridge, N. Durmishi, P. Eichhammer, J. Eriksson, V. Escott-Price, L. Essioux, A. H. Fanous, M. S. Farrell, J. Frank, L. Franke, R. Freedman, N. B. Freimer, M. Friedl, J. I. Friedman, M. Fromer, G. Genovese, L. Georgieva, I. Giegling, P. Giusti-Rodríguez, S. Godard, J. I. Goldstein, V. Golimbet, S. Gopal, J. Gratten, L. de Haan, C. Hammer, M. L. Hamshere, M. Hansen, T. Hansen, V. Haroutunian, A. M. Hartmann, F. A. Henskens, S. Herms, J. N. Hirschhorn, P. Hoffmann, A. Hofman, M. V. Hollegaard, D. M. Hougaard, M. Ikeda, I. Joa, A. Julià, R. S. Kahn, L. Kalaydjieva, S. Karachanak-Yankova, J. Karjalainen, D. Kavanagh, M. C. Keller, J. L. Kennedy, A. Khrunin, Y. Kim, J. Klovins, J. A. Knowles, B. Konte, V. Kucinskas, Z. Ausrele Kucinskiene, H. Kuzelova-Ptackova, A. K. Kähler, C. Laurent, J. L. Keong, S. H. Lee, S. E. Legge, B. Lerer, M. Li, T. Li, K. Y. Liang, J. Lieberman, S. Limborska, C. M. Loughland, J. Lubinski, J. Lönnqvist, M. Macek Jr, P. K. Magnusson, B. S. Maher, W. Maier, J. Mallet, S. Marsal, M. Mattheisen, M. Mattingsdal, R. W. McCarley, C. McDonald, A. M. McIntosh, S. Meier, C. J. Meijer, B. Melegh, I. Melle, R. I. Mesholam-Gately, A. Metspalu, P. T. Michie, L. Milani, V. Milanova, Y. Mokrab, D. W. Morris, O. Mors, K. C. Murphy, R. M. Murray, I. Myin-Germeys, B. Müller-Myhsok, M. Nelis, I. Nenadic, D. A. Nertney, G. Nestadt, K. K. Nicodemus, L. Nikitina-Zake, L. Nisenbaum, A. Nordin, E. O’Callaghan, C. O’Dushlaine, F. A. O’Neill, S. Y. Oh, A. Olincy, L. Olsen, J. Van Os, C. Pantelis, G. N. Papadimitriou, S. Papiol, E. Parkhomenko, M. T. Pato, T. Paunio, M. Pejovic-Milovancevic, D. O. Perkins, O. Pietiläinen, J. Pimm, A. J. Pocklington, J. Powell, A. Price, A. E. Pulver, S. M. Purcell, D. Quested, H. B. Rasmussen, A. Reichenberg, M. A. Reimers, A. L. Richards, J. L. Roffman, P. Roussos, D. M. Ruderfer, V. Salomaa, A. R. Sanders, U. Schall, C. R. Schubert, T. G. Schulze, S. G. Schwab, E. M. Scolnick, R. J. Scott, L. J. Seidman, J. Shi, E. Sigurdsson, T. Silagadze, J. M. Silverman, K. Sim, P. Slominsky, J. W. Smoller, H. C. So, C. A. Spencer, E. A. Stahl, H. Stefansson, S. Steinberg, E. Stogmann, R. E. Straub, E. Strengman, J. Strohmaier, T. S. Stroup, M. Subramaniam, J. Suvisaari, D. M. Svrakic, J. P. Szatkiewicz, E. Söderman, S. Thirumalai, D. Toncheva, S. Tosato, J. Veijola, J. Waddington, D. Walsh, D. Wang, Q. Wang, B. T. Webb, M. Weiser, D. B. Wildenauer, N. M. Williams, S. Williams, S. H. Witt, A. R. Wolen, E. H. Wong, B. K. Wormley, H. S. Xi, C. C. Zai, X. Zheng, F. Zimprich, N. R. Wray, K. Stefansson, P. M. Visscher, R. Adolfsson, O. A. Andreassen, D. H. Blackwood, E. Bramon, J. D. Buxbaum, A. D. Børglum, S. Cichon, A. Darvasi, E. Domenici, H. Ehrenreich, T. Esko, P. V. Gejman, M. Gill, H. Gurling, C. M. Hultman, N. Iwata, A. V. Jablensky, E. G. Jönsson, K. S. Kendler, G. Kirov, J. Knight, T. Lencz, D. F. Levinson, Q. S. Li, J. Liu, A. K. Malhotra, S. A. McCarroll, A. McQuillin, J. L. Moran, P. B. Mortensen, B. J. Mowry, M. M. Nöthen, R. A. Ophoff, M. J. Owen, A. Palotie, C. N. Pato, T. L. Petryshen, D. Posthuma, M. Rietschel, B. P. Riley, D. Rujescu, P. C. Sham, P. Sklar, D. St Clair, D. R. Weinberger, J. R. Wendland, T. Werge, M. J. Daly, P. F. Sullivan and M. C. O’Donovan. (2014): “Biological insights from 108 schizophrenia-associated genetic loci,” Nature, 511, 421–427.Google Scholar

  • Schapire, R. E. (1990): “The Strength of Weak Learnability,” Mach. Learn., 5, 197–227.Google Scholar

  • Sierra, A. and A. Echeverria (2003): “Skipping Fisher’s criterion,” Pattern Recognition and Image Analysis, Vol. 2652 of series Lecture Notes in Computer Science, 962–969.Google Scholar

  • Singh, D., P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Landers, M. Loda, P. W. Kantoff, T. R. Golub and W. R. Sellers (2002): “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, 1, 203–209.Google Scholar

  • Stigler, S. M. (2010): “The changing history of robustness,” Am. Stat., 64, 277–281.Google Scholar

  • Stokes, M. E. and S. Visweswaran (2012): “Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease,” BioData Min., 5, 20.Google Scholar

  • Storey, J. D. (2002): “A direct approach to false discovery rates,” J. R. Stat. Soc. Series B Stat. Methodol., 64, 479–498.Google Scholar

  • Storey, J. D., J. E. Taylor and D. Siegmund (2004): “Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: a unified approach,” J. R. Stat. Soc. Series B Stat. Methodol., 66, 187–205.Google Scholar

  • Su, Y., T. M. Murali, V. Pavlovic, M. Schaffer and S. Kasif (2003): “RankGene: identification of diagnostic genes based on expression data,” Bioinformatics, 19, 1578–1579.Google Scholar

  • Thomas, R., L. de la Torre, X. Chang and S. Mehrotra (2010): “Validation and characterization of DNA microarray gene expression data distribution and associated moments,” BMC Bioinformatics, 11, 576.Google Scholar

  • Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso: a retrospective,” J. R. Stat. Soc. Series B Stat. Methodol., 73: 273–282.Google Scholar

  • Van Steen, K. (2012): “Travelling the world of gene-gene interactions,” Brief. Bioinform., 13, 1–19.Google Scholar

  • Wang, C. and B. Liu (2008): “Data mining and hotspot detection in an urban development project,” J. Data. Sci., 6, 389–414.Google Scholar

  • Wang, C. and M. Zhuravlev (2009): “An analysis of profit and customer satisfaction in consumer finance,” Case Studies Bus. Ind. Gov. Stat., 2, 147–156.Google Scholar

  • Wang, C., W. Howell and C. Wang (2015): “Gene search and the related risk estimates: a statistical analysis of prostate cancer data,” In: Practical predictive analytics and decision systems for medicine, Academic Press, London, 896–920.Google Scholar

  • Wang, X. S. and R. Simon (2011): “Microarray-based cancer prediction using single genes,” BMC Bioinformatics, 12, 391.Google Scholar

  • Weston, J., A. Elissee, B. Scholkopf and M. Tipping (2003): “Use of the zero-norm with linear models and kernel methods,” J. Mach. Learn. Res., 3, 1439–1461.Google Scholar

  • Weston, J., S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik. (2001): “Feature selection for SVMs,” Adv. Neural. Inf. Process. Syst., 13, 668–674.Google Scholar

  • Yang, Z. R. (2010): Machine learning approaches to bioinformatics (science, engineering, and biology informatics), vol. 4, World Scientific Publishing, New Jersey, USA.Google Scholar

  • Yuan, M. and Y. Lin (2007): “On the non-negative garrotte estimator,” J. R. Stat. Soc. Series B Stat. Methodol., 69, 143–161.Google Scholar

  • Zhao, P. and B. Yu (2006): “On model selection consistency of Lasso,” J. Mach. Learn Res., 7, 2541–2563.Google Scholar

  • Zou, H. (2006): “The Adaptive Lasso and Its Oracle Properties,” J. Am. Stat. Assoc., 101, 1418–1429.Google Scholar

  • Zuber, V. and K. Strimmer (2011): “High-dimensional regression and variable selection using CAR scores,” Stat. Appl. Genet. Mol. Biol., 10, 34.Google Scholar

About the article

Published Online: 2016-05-25

Published in Print: 2016-08-01

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 15, Issue 4, Pages 321–347, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2015-0072.

Export Citation

©2016 by De Gruyter.Get Permission

Comments (0)

Please log in or register to comment.
Log in