Abstract
Risk prediction models can link high-dimensional molecular measurements, such as DNA methylation, to clinical endpoints. For biological interpretation, often a sparse fit is desirable. Different molecular aggregation levels, such as considering DNA methylation at the CpG, gene, or chromosome level, might demand different degrees of sparsity. Hence, model building and estimation techniques should be able to adapt their sparsity according to the setting. Additionally, underestimation of coefficients, which is a typical problem of sparse techniques, should also be addressed. We propose a comprehensive approach, based on a boosting technique that allows a flexible adaptation of model sparsity and addresses these problems in an integrative way. The main motivation is to have an automatic sparsity adaptation. In a simulation study, we show that this approach reduces underestimation in sparse settings and selects more adequate model sizes than the corresponding non-adaptive boosting technique in non-sparse settings. Using different aggregation levels of DNA methylation data from a study in kidney carcinoma patients, we illustrate how automatically selected values of the sparsity tuning parameter can reflect the underlying structure of the data. In addition to that, prediction performance and variable selection stability is compared to the non-adaptive boosting approach.
References
Asakura, T., A. Imai, N. Ohkubo-Uraoka, M. Kuroda, Y. Iidaka, K. Uchida, T. Shibasaki and K. Ohkawa (2005): “Relationship between expression of drugresistance factors and drug sensitivity in normal human renal proximal tubular epithelial cells in comparison with renal cell carcinoma.” Oncol. Rep., 14, 601–607.Search in Google Scholar
Benner, A., M. Zucknick, T. Hielscher, C. Ittrich and U. Mansmann (2010): “High-dimensional cox models: the choice of penalty as part of the model Building process,” Biometrical J., 52, 50–69.Search in Google Scholar
Binder, H. and M. Schumacher (2008a): “Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples,” Statistical Applications in Genetics and Molecular Biology, 7, Article 12.10.2202/1544-6115.1346Search in Google Scholar PubMed
Binder, H. and M. Schumacher (2008b): “Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models,” BMC Bioinformatics, 9, 14.10.1186/1471-2105-9-14Search in Google Scholar PubMed PubMed Central
Binder, H. and M. Schumacher (2009): “Incorporating pathway information into boosting estimation of high-dimensional risk prediction models,” BMC Bioinformatics, 10, 18.10.1186/1471-2105-10-18Search in Google Scholar PubMed PubMed Central
Buhlmann, P. and B. Yu (2003): “Boosting with the L2 Loss: Regression and Classification,” J. Am. Stat. Assoc., 98, 324–339.Search in Google Scholar
Candes, E. and T. Tao (2007): “The Dantzig selector: statistical estimation when p is much larger than n,” Ann. Stat., 35, 2313–2351.Search in Google Scholar
Dedeurwaerder, S., M. Defrance, E. Calonne, H. Denis, C. Sotiriou and F. Fuks (2011): “Evaluation of the infinium methylation 450K technology,” Epigenomics, 3, 771–784.10.2217/epi.11.105Search in Google Scholar PubMed
Efron, B., T. Hastie, I. Johnstone and R. Tibshirani (2004): “Least angle regression,” Ann. Stat., 32, 407–499.Search in Google Scholar
Efron, B. and R. Tibshirani (1997): “Improvements on cross-validation: the 0.632 bootstrap method,” J. Am. Stat. Assoc., 92, 548–560.Search in Google Scholar
Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer,” P. Nat. Acad. Sci., 103, 5923–5928.Search in Google Scholar
Engler, D. and Y. Li (2009): “Survival analysis with high-dimensional covariates: an application in microarray studies,” Statistical Applications in Genetics and Molecular Biology, 8, 1–56.10.2202/1544-6115.1423Search in Google Scholar PubMed PubMed Central
Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1348–1360.Search in Google Scholar
Hakimi, A. A., I. Ostrovnaya, B. Reva, N. Schultz, Y.-B. Chen, M. Gonen, H. Liu, S. Takeda, M. H. Voss, S. K. Tickoo, V. E. Reuter, P. Russo, E. H. Cheng, C. Sander, R. J. Motzer and J. J. Hsieh (2013): “Adverse outcomes in clear cell renal cell carcinoma with mutations of 3p21 epigenetic regulators BAP1 and SETD2: A report by MSKCC and the KIRC TCGA research network,” Clin. Cancer Res., 19, 3259–3267.10.1158/1078-0432.CCR-12-3886Search in Google Scholar PubMed PubMed Central
Li, J. and S. Ma (2013): Survival analysis in medicine and genetics, Chapman and Hall/CRC Biostatistics Series, CRC Press LLC.Search in Google Scholar
R Development Core Team (2013): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/.Search in Google Scholar
Sandoval, J., H. Heyn, S. Moran, J. Serra-Musach, M. A. Pujana, M. Bibikova and M. Esteller (2011): “Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome,” Epigenetics, 6, 692–702.10.4161/epi.6.6.16196Search in Google Scholar PubMed
Schmid, M. and T. Hothorn (2008): “Flexible boosting of accelerated failure time models,” BMC Bioinformatics, 9, 269.10.1186/1471-2105-9-269Search in Google Scholar PubMed PubMed Central
Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc., B, 58, 267–288.Search in Google Scholar
Tutz, G. and H. Binder (2006): “Generalized additive modeling with implicit variable selection by likelihood-based boosting,” Biometrics, 62, 961–971.10.1111/j.1541-0420.2006.00578.xSearch in Google Scholar PubMed
Tutz, G. and H. Binder (2007): “Boosting ridge regression,” Comput. Stat. Data Anal., 51, 6044–6059.Search in Google Scholar
Wang, S., B. Nan, J. Zhu and D. G. Beer (2008): “Doubly penalized buckleyjames method for survival data with high-dimensional covariates,” Biometrics, 64, 132–140.10.1111/j.1541-0420.2007.00877.xSearch in Google Scholar PubMed
Xie, H. and J. Huang (2009): “SCAD-Penalized regression in high-dimensional partially linear models,” Ann. Stat., 37, 673–696.Search in Google Scholar
Yang, Y. (2007): “Prediction/Estimation with simple linear models: is it really that simple?” Economet. Theor., 23, 1–36.Search in Google Scholar
Ziller, M. J., H. Gu, F. Mller, J. Donaghey, O. Kohlbacher, B. E. Bernstein, A. Gnirke and A. Meissner (2013): “Charting a dynamic DNA methylation landscape of the human genome,” Nature, 500, 477481.10.1038/nature12433Search in Google Scholar PubMed PubMed Central
Zou, H. (2006): “The adaptive lasso and its oracle properties,” J. Am. Stat. Assoc., 101, 1418–1429.Search in Google Scholar
Zou, H. and T. Hastie (2005): “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc.: Series B (Statistical Methodology), 67, 301–320.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar
©2014 by Walter de Gruyter Berlin/Boston