In clinical outcome prediction, such as disease diagnosis and prognosis, it is often assumed that the class, e.g., disease and control, is equally distributed. However, in practice we often encounter biological or clinical data whose class distribution is highly skewed. Since standard supervised learning algorithms intend to maximize the overall prediction accuracy, a prediction model tends to show a strong bias toward the majority class when it is trained on such imbalanced data. Therefore, the class distribution should be incorporated appropriately to learn from imbalanced data. To address this practically important problem, we proposed balanced gradient boosting (BalaBoost) which reformulates gradient boosting to avoid the overfitting to the majority class and is sensitive to the minority class by making use of the equal class distribution instead of the empirical class distribution. We applied BalaBoost to cancer tissue diagnosis based on miRNA expression data, premature death prediction for diabetes patients based on biochemical and clinical variables and tumor grade prediction of renal cell carcinoma based on tumor marker expressions whose class distribution is highly skewed. Experimental results showed that BalaBoost outperformed the representative supervised learning algorithms, i.e., gradient boosting, Random Forests and Support Vector Machine. Our results led us to the conclusion that BalaBoost is promising for clinical outcome prediction from imbalanced data.

Editor-in-Chief: Stumpf, Michael P.H.
Editorial Board Member: Beaumont, Mark / Binder, Harald / Gupta, Mayetri / Hubbard, Alan E. / Husmeier, Dirk / Ji, Hongkai / Keles, Sunduz / Kerr, Kathleen / Lazzeroni, Laura / Lin, Shili / Ma, Ping / Marjoram, Paul / Mertens, Bart / Nerman, Olle / G. Petretto, Enrico / Plagnol, Vincent / Purdom, Elizabeth / Robin, Stéphane / Rzhetsky, Andrey / Sanguinetti, Guido / van der Laan, Mark J. / von Haeseler, Arndt / Weeks, Daniel E. / Wiuf, Carsten / Zhao, Hongyu
1 Issue per year
IMPACT FACTOR 2011: 1.517
5-year IMPACT FACTOR: 1.704
Rank 27 out of 116 in category Statistics & Probability in the 2011 Thomson Reuters Journal Citation Report/Science Edition
Issues
Volume 12 (2013)
Volume 11 (2012)
Volume 10 (2011)
Volume 9 (2010)
Volume 8 (2009)
Volume 7 (2008)
Volume 6 (2007)
Volume 5 (2006)
Volume 4 (2005)
Volume 3 (2004)
Volume 2 (2003)
Volume 1 (2002)
Most Downloaded Articles
- A General Framework for Weighted Gene Co-Expression Network Analysis by Zhang, Bin and Horvath, Steve
- Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Smyth, Gordon K
- Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates by Lund, Steven P./ Nettleton, Dan/ McCarthy, Davis J. and Smyth, Gordon K.
- Adjusting for Spurious Gene-by-Environment Interaction Using Case-Parent Triads by Shin, Ji-Hyung/ Infante-Rivard, Claire/ Graham, Jinko and McNeney, Brad
- A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics by Schäfer, Juliane and Strimmer, Korbinian
Balanced Gradient Boosting from Imbalanced Data for Clinical Outcome Prediction
1Bio-IT Center, NEC Corporation
Citation Information: Statistical Applications in Genetics and Molecular Biology. Volume 8, Issue 1, Pages 1–19, ISSN (Online) 1544-6115, DOI: 10.2202/1544-6115.1422, April 2009
- Published Online:
- 2009-04-07
Keywords: clinical outcome; diagnosis; cancer; diabetes; renal cell carcinoma; ensemble learning; boosting; cost-sensitive learning; imbalanced data


















Comments (0)