Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter March 14, 2014

A boosting approach for adapting the sparsity of risk prediction signatures based on different molecular levels

Murat Sariyar, Martin Schumacher and Harald Binder


Risk prediction models can link high-dimensional molecular measurements, such as DNA methylation, to clinical endpoints. For biological interpretation, often a sparse fit is desirable. Different molecular aggregation levels, such as considering DNA methylation at the CpG, gene, or chromosome level, might demand different degrees of sparsity. Hence, model building and estimation techniques should be able to adapt their sparsity according to the setting. Additionally, underestimation of coefficients, which is a typical problem of sparse techniques, should also be addressed. We propose a comprehensive approach, based on a boosting technique that allows a flexible adaptation of model sparsity and addresses these problems in an integrative way. The main motivation is to have an automatic sparsity adaptation. In a simulation study, we show that this approach reduces underestimation in sparse settings and selects more adequate model sizes than the corresponding non-adaptive boosting technique in non-sparse settings. Using different aggregation levels of DNA methylation data from a study in kidney carcinoma patients, we illustrate how automatically selected values of the sparsity tuning parameter can reflect the underlying structure of the data. In addition to that, prediction performance and variable selection stability is compared to the non-adaptive boosting approach.

Corresponding author: Murat Sariyar, Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, 55131 Mainz, Germany; and Institute of Pathology, Charite – University Medicine Berlin, Campus Benjamin Franklin, Berlin 12200, Germany, e-mail:


Asakura, T., A. Imai, N. Ohkubo-Uraoka, M. Kuroda, Y. Iidaka, K. Uchida, T. Shibasaki and K. Ohkawa (2005): “Relationship between expression of drugresistance factors and drug sensitivity in normal human renal proximal tubular epithelial cells in comparison with renal cell carcinoma.” Oncol. Rep., 14, 601–607.Search in Google Scholar

Benner, A., M. Zucknick, T. Hielscher, C. Ittrich and U. Mansmann (2010): “High-dimensional cox models: the choice of penalty as part of the model Building process,” Biometrical J., 52, 50–69.Search in Google Scholar

Binder, H. and M. Schumacher (2008a): “Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples,” Statistical Applications in Genetics and Molecular Biology, 7, Article 12.10.2202/1544-6115.1346Search in Google Scholar PubMed

Binder, H. and M. Schumacher (2008b): “Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models,” BMC Bioinformatics, 9, 14.10.1186/1471-2105-9-14Search in Google Scholar PubMed PubMed Central

Binder, H. and M. Schumacher (2009): “Incorporating pathway information into boosting estimation of high-dimensional risk prediction models,” BMC Bioinformatics, 10, 18.10.1186/1471-2105-10-18Search in Google Scholar PubMed PubMed Central

Buhlmann, P. and B. Yu (2003): “Boosting with the L2 Loss: Regression and Classification,” J. Am. Stat. Assoc., 98, 324–339.Search in Google Scholar

Candes, E. and T. Tao (2007): “The Dantzig selector: statistical estimation when p is much larger than n,” Ann. Stat., 35, 2313–2351.Search in Google Scholar

Dedeurwaerder, S., M. Defrance, E. Calonne, H. Denis, C. Sotiriou and F. Fuks (2011): “Evaluation of the infinium methylation 450K technology,” Epigenomics, 3, 771–784.10.2217/epi.11.105Search in Google Scholar PubMed

Efron, B., T. Hastie, I. Johnstone and R. Tibshirani (2004): “Least angle regression,” Ann. Stat., 32, 407–499.Search in Google Scholar

Efron, B. and R. Tibshirani (1997): “Improvements on cross-validation: the 0.632 bootstrap method,” J. Am. Stat. Assoc., 92, 548–560.Search in Google Scholar

Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer,” P. Nat. Acad. Sci., 103, 5923–5928.Search in Google Scholar

Engler, D. and Y. Li (2009): “Survival analysis with high-dimensional covariates: an application in microarray studies,” Statistical Applications in Genetics and Molecular Biology, 8, 1–56.10.2202/1544-6115.1423Search in Google Scholar PubMed PubMed Central

Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1348–1360.Search in Google Scholar

Hakimi, A. A., I. Ostrovnaya, B. Reva, N. Schultz, Y.-B. Chen, M. Gonen, H. Liu, S. Takeda, M. H. Voss, S. K. Tickoo, V. E. Reuter, P. Russo, E. H. Cheng, C. Sander, R. J. Motzer and J. J. Hsieh (2013): “Adverse outcomes in clear cell renal cell carcinoma with mutations of 3p21 epigenetic regulators BAP1 and SETD2: A report by MSKCC and the KIRC TCGA research network,” Clin. Cancer Res., 19, 3259–3267.10.1158/1078-0432.CCR-12-3886Search in Google Scholar PubMed PubMed Central

Li, J. and S. Ma (2013): Survival analysis in medicine and genetics, Chapman and Hall/CRC Biostatistics Series, CRC Press LLC.Search in Google Scholar

R Development Core Team (2013): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, URL in Google Scholar

Sandoval, J., H. Heyn, S. Moran, J. Serra-Musach, M. A. Pujana, M. Bibikova and M. Esteller (2011): “Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome,” Epigenetics, 6, 692–702.10.4161/epi.6.6.16196Search in Google Scholar PubMed

Schmid, M. and T. Hothorn (2008): “Flexible boosting of accelerated failure time models,” BMC Bioinformatics, 9, 269.10.1186/1471-2105-9-269Search in Google Scholar PubMed PubMed Central

Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc., B, 58, 267–288.Search in Google Scholar

Tutz, G. and H. Binder (2006): “Generalized additive modeling with implicit variable selection by likelihood-based boosting,” Biometrics, 62, 961–971.10.1111/j.1541-0420.2006.00578.xSearch in Google Scholar PubMed

Tutz, G. and H. Binder (2007): “Boosting ridge regression,” Comput. Stat. Data Anal., 51, 6044–6059.Search in Google Scholar

Wang, S., B. Nan, J. Zhu and D. G. Beer (2008): “Doubly penalized buckleyjames method for survival data with high-dimensional covariates,” Biometrics, 64, 132–140.10.1111/j.1541-0420.2007.00877.xSearch in Google Scholar PubMed

Xie, H. and J. Huang (2009): “SCAD-Penalized regression in high-dimensional partially linear models,” Ann. Stat., 37, 673–696.Search in Google Scholar

Yang, Y. (2007): “Prediction/Estimation with simple linear models: is it really that simple?” Economet. Theor., 23, 1–36.Search in Google Scholar

Ziller, M. J., H. Gu, F. Mller, J. Donaghey, O. Kohlbacher, B. E. Bernstein, A. Gnirke and A. Meissner (2013): “Charting a dynamic DNA methylation landscape of the human genome,” Nature, 500, 477481.10.1038/nature12433Search in Google Scholar PubMed PubMed Central

Zou, H. (2006): “The adaptive lasso and its oracle properties,” J. Am. Stat. Assoc., 101, 1418–1429.Search in Google Scholar

Zou, H. and T. Hastie (2005): “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc.: Series B (Statistical Methodology), 67, 301–320.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar

Published Online: 2014-3-14
Published in Print: 2014-6-1

©2014 by Walter de Gruyter Berlin/Boston

Scroll Up Arrow