Abstract
Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.
Funding source: Chinese Natural Science Foundation
Award Identifier / Grant number: 11528102
Funding statement: The authors thank Dr. Kirsten Herold at the UM-SPH Writing lab for her helpful suggestions. Chinese Natural Science Foundation, Grant Number: 11528102.
References
Ayers, K. and H. Cordell (2010): “SNP selection in genome-wide and candidate gene studies via penalized logistic regression,” Genet. Epidemiol., 34, 879–891.10.1002/gepi.20543Search in Google Scholar PubMed PubMed Central
Barber, R. and E. Candês (2015): “Controlling the false discovery rate via knockoffs,” Ann. Stat., 43, 2055–2085.10.1214/15-AOS1337Search in Google Scholar
Benjamini, Y. and Y. Hochberg (1995): “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” J. R. Stat. Soc. Series B Stat. Methodol., 57, 289–300.10.1111/j.2517-6161.1995.tb02031.xSearch in Google Scholar
Bühlmann, P. and S. van de Geer (2011): Statistics for high-dimensional data: methods, theory and applications, Berlin Heidelberg: Springer-Verlag.10.1007/978-3-642-20192-9Search in Google Scholar
Cho, S., K. Kim, Y. Kim, J. Lee, Y. Cho, J. Lee, B. Han, H. Kim, J. Ott and T. Park (2010): “Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis,” Ann. Hum. Genet., 74, 416–428.10.1111/j.1469-1809.2010.00597.xSearch in Google Scholar PubMed
Efron, B. (2008): “Microarrays, empirical Bayes and the two groups model,” Stat. Sci., 23, 1–22.10.1214/07-STS236Search in Google Scholar
Efron, B. (2013): Large-scale inference: empirical bayes methods for estimation, testing, and prediction, Cambridge, UK: Cambridge University Press.Search in Google Scholar
Efron, B. (2014): “Estimation and accuracy after model selection,” J. Am. Stat. Assoc., 109, 991–1007.10.1080/01621459.2013.823775Search in Google Scholar PubMed PubMed Central
Fan, J. and J. Lv (2008): “Sure independence screening for ultrahigh dimensional feature space,” J. R. Stat. Soc. Series B Stat. Methodol., 70, 849–911.10.1111/j.1467-9868.2008.00674.xSearch in Google Scholar PubMed PubMed Central
Genovese, C. and L. Wasserman (2004): “A stochastic process approach to false discovery control,” Ann. Stat., 32, 1035–1061.10.1214/009053604000000283Search in Google Scholar
Gui, J. and H. Li (2005): “Penalized cox regression analysis in the high-dimensional and low-sample size settings with application to microarray gene expression data,” Bioinformatics, 21, 3001–3008.10.1093/bioinformatics/bti422Search in Google Scholar PubMed
Hastie, T., R. Tibshirani and J. Friedman (2009): The elements of statistical learning: data mining, inference, and prediction, New York: Springer.10.1007/978-0-387-84858-7Search in Google Scholar
He, K., Y. Li, J. Zhu, H. Liu, J. Lee, C. Amos, T. Hyslop, J. Jin, H. Lin, Q. Wei and Y. Li (2016): “Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates,” Bioinformatics, 32, 50–57.10.1093/bioinformatics/btv517Search in Google Scholar PubMed PubMed Central
Meinshausen, N., L. Meier and P. Bühlmann (2009): “P-values for highdimensional regression,” J. Am. Stat. Assoc., 104, 1671–1681.10.1198/jasa.2009.tm08647Search in Google Scholar
Meinshausen, N., L. Meier and P. Bühlmann (2010): “Stability selection (with discussion),” J. R. Stat. Soc. Series B Stat. Methodol., 72, 417–473.10.1111/j.1467-9868.2010.00740.xSearch in Google Scholar
Scott, L., M. Erdos, J. Huyghe, R. Welch, A. Beck, M. Boehnke, F. Collins and S. Parker (2016): “The genetic regulatory sigature of type 2 diabetes in human skeletal muscle,” Nat. Commun., 7, 1–12.Search in Google Scholar
Shaughnessy, J., F. Zhan, B. Burington, Y. Huang, S. Colla, I. Hanamura, J. Stewart, B. Kordsmeier, C. Randolph, D. Williams, Y. Xiao, H. Xu, J. Epstein, E. Anaissie, S. Krishna, M. Cottler-Fox, K. Hollmig, A. Mohiuddin, M. Pineda-Roman, G. Tricot, F. van Rhee, J. Sawyer, Y. Alsayed, R. Walker, M. Zangari, J. Crowley and B. Barlogie (2007): “A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1,” Blood, 109, 2276–2284.10.1182/blood-2006-07-038430Search in Google Scholar PubMed
Shi, L., G. Campbell, W. Jones and M. Consortium (2010): “The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models,” Nat. Biotechnol., 28, 827–838.10.1038/nbt.1665Search in Google Scholar PubMed PubMed Central
Simon, N., J. Friedman, T. Hastie and R. Tibshirani (2011): “Regularization paths for Cox’s proportional hazards model via coordinate descent,” J. Stat. Softw., 39, 1–13.10.18637/jss.v039.i05Search in Google Scholar PubMed PubMed Central
Sun, S., M. Hood, L. Scott, Q. Peng, S. Mukherjee, J. Tung and X. Zhou (2017): “Differential expression analysis for RNAseq using Poisson mixed models,” Nucleic Acids Res., 45, e106.10.1093/nar/gkx204Search in Google Scholar PubMed PubMed Central
Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B Stat. Methodol., 58, 267–288.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar
Tusher, V., R. Tibshirani and G. Chu (2001): “Significane analysis of microarrays applied to the ionizing radiation repsonse,” Proc. Natl. Acad. Sci. USA, 98, 5116–5121.10.1073/pnas.091062498Search in Google Scholar PubMed PubMed Central
Uno, H., T. Cai, L. Tian and L. J. Wei (2007): “Evaluating prediction rules for t-year survivors with censored regression models,” J. Am. Stat. Assoc., 102, 527–537.10.1198/016214507000000149Search in Google Scholar
Wu, T., Y. Chen, T. Hastie, E. Sobel and K. Lange (2009): “Genome-wide association analysis by lasso penalized logistic regression,” Bioinformatics, 25, 714–721.10.1093/bioinformatics/btp041Search in Google Scholar PubMed PubMed Central
Zou, H. and T. Hastie (2005): “Regression shrinkage and selection via the elastic net with application to microarrays,” J. R. Stat. Soc. Series B Stat. Methodol., 67, 301–320.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar
Supplementary Material
The online version of this article offers supplementary material (DOI: https://doi.org/10.1515/sagmb-2018-0038).
©2018 Walter de Gruyter GmbH, Berlin/Boston