Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

2 Issues per year


IMPACT FACTOR 2016: 0.500
5-year IMPACT FACTOR: 0.862

CiteScore 2016: 0.42

SCImago Journal Rank (SJR) 2016: 0.488
Source Normalized Impact per Paper (SNIP) 2016: 0.467

Mathematical Citation Quotient (MCQ) 2016: 0.09

Online
ISSN
1557-4679
See all formats and pricing
More options …

Optimal Spatial Prediction Using Ensemble Machine Learning

Molly Margaret Davies / Mark J. van der Laan
Published Online: 2016-04-29 | DOI: https://doi.org/10.1515/ijb-2014-0060

Abstract

Spatial prediction is an important problem in many scientific disciplines. Super Learner is an ensemble prediction approach related to stacked generalization that uses cross-validation to search for the optimal predictor amongst all convex combinations of a heterogeneous candidate set. It has been applied to non-spatial data, where theoretical results demonstrate it will perform asymptotically at least as well as the best candidate under consideration. We review these optimality properties and discuss the assumptions required in order for them to hold for spatial prediction problems. We present results of a simulation study confirming Super Learner works well in practice under a variety of sample sizes, sampling designs, and data-generating functions. We also apply Super Learner to a real world dataset.

Keywords: cross-validation; spatial interpolation; generalized stacking; oracle inequality; Super Learner

References

  • 1. Cressie N. Statistics for spatial data, revised ed. Wiley Series in probability and mathematical statistics. New York: John Wiley and Sons, Inc, 1993.

  • 2. Schabenberger O, Gotway C. Statistical methods for spatial data analysis. Texts in Statistical Science. Boca Raton: Chapman & Hall, CRC, 2005.

  • 3. Zaier I, Shu C, Ouarda T, Seidou O, Chebana F. Estimation of ice thickness on lakes using artificial neural network ensembles. J Hydrol 2010;383:330–40.Google Scholar

  • 4. Chen J, Wang C. Using stacked generalization to combine SVMs in magnitude and shape feature spaces for classification of hyperspectral data. IEEE Trans Geosci Remote Sensing 2009;47:2193–205.Google Scholar

  • 5. Rossi M, Guzzetti F, Reichenbach P, Mondini AC, Peruccacci S. Optimal landslide susceptibility zonation based on multiple forecasts. Geomorphology 2010;114:129–42.Google Scholar

  • 6. Kleiber W, Raftery A, Gneiting T. Geostatistical model averaging for locally calibrated probabilistic quantitative precipitation forecasting. J Am Stat Assoc 2011;106:1291–303.Google Scholar

  • 7. Polley E, van der Laan M. Super learner in prediction U.C. Berkeley Division of Biostatistics Working Paper Series, 2010.

  • 8. Wolpert D. Stacked generalization. Neural Networks 1992;5:241–59.Google Scholar

  • 9. Breiman L. Stacked regressions. Machine Learning 1996;24:49–64.Google Scholar

  • 10. LeBlanc M, Tibshirani R. Combining estimates in regression and classification. J Am Stat Assoc 1996;91:1641–50.Google Scholar

  • 11. Stone M. Cross-validatory choice and assessment of statistical procedures. J R Stat Soc Ser B 1974;36:111–47.Google Scholar

  • 12. Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc 1975;70:320–8.Google Scholar

  • 13. Polley E, Rose S, van der Laan M. Targeted learning: casual inference for observational and experimental data. New York: Springer, chapter 3: Super Learning, 43–65, 2011.Google Scholar

  • 14. Opsomer J, Wang Y, Yang Y. Nonparametric regression with correlated errors. Stat Sci 2001;16:134–53.Google Scholar

  • 15. Francisco-Fernandez M, Opsomer J. Smoothing Parameter Selection Methods for Nonparametric Regression with Spatially Correlated Errors. The Canadian Journal of Statistics 2005;33:279–95.Google Scholar

  • 16. van der Laan M, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples U.C. Berkeley Division of Biostatistics Working Paper Series, 2003.

  • 17. van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross validation. Stat Decisions 2006;24:351–71.Google Scholar

  • 18. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv 2010;4:44–79.Google Scholar

  • 19. Lumley T. An empirical process limit theorem for sparsely correlated data UW Biostatistics Working Paper Series, 2005.

  • 20. Jiang W. On uniform deviations of general empirical risks with unboundedness, dependence, and high dimensionality. J Mach Learning Res 2009;10:977–96.Google Scholar

  • 21. Davis B. Uses and abuses of cross-validation in geostatistics. Math Geol 1987;19:241–8.Google Scholar

  • 22. Todini E. Influence of parameter estimation on uncertainty in Kriging: Part 1 – theoretical development. Hydrol Earth Syst Sci 2001;5:215–23.Google Scholar

  • 23. Matérn B. Spatial variation, 2nd ed. New York: Springer, 1986.Google Scholar

  • 24. Huang H, Chen C. Optimal geostatistical model selection. J Am Stat Assoc 2007;102:1009–24.Google Scholar

  • 25. Gu C. Smoothing spline ANOVA models. New York: Springer, 2002.Google Scholar

  • 26. R Development Core Team. R: A language and environment for statistical computing, r foundation for statistical computing, Vienna, Austria. Available at: http://www.R-project.org/, 2012.

  • 27. Neugebauer R, Bullard J. DSA: Deletion/Substitution/Addition algorithm. Available at: http://www.stat.berkeley.edu/laan/Software/, r package version 3.1.4, 2010.

  • 28. Hastie T. gam: generalized additive models. Available at: http://CRAN.R-project.org/package=gam, r package version 1.04.1, 2011.

  • 29. Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab – an S4 Package for Kernel Methods in R. J Stat Software 2004;11:1–20. Available at: http://www.jstatsoft.org/v11/i09/.Google Scholar

  • 30. Ridgeway G. gbm: Generalized boosted regression models. Available at: http://CRAN.R-project.org/package=gbm, r package version 1.6-3.1, 2010.

  • 31. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Software 2010;33. Available at: http://www.jstatsoft.org/v33/i01/.Google Scholar

  • 32. Li S. FNN: fast nearest neighbor search algorithms and applications. Available at: http://CRAN.R-project.org/package=FNN, r package version 0.6-3, 2012.

  • 33. PJ, PD, Ribeiro J. Model based geostatistics. New York: Springer, 2007.Google Scholar

  • 34. PJ Ribeiro J, Diggle P. 2001. geoR: a package for geostatistical analysis. R News 2001;1:14–18. Available at: http://CRAN.R-project.org/doc/Rnews/.Google Scholar

  • 35. Kooperberg C. polspline: polynomial spline routines. Available at: http://CRAN.R-project.org/package=polspline, r package version 1.1.5, 2010.

  • 36. Liaw A, Wiener M. Classification and regression by randomforest. R News 2002;2:18–22. Available at: http://CRAN.R-project.org/doc/Rnews/.Google Scholar

  • 37. Furrer R, Nychka D, Sain S. Fields: tools for spatial data. Available at: http://CRAN.R-project.org/package=fields, r package version 6.6.1, 2011.

  • 38. Sinisi S, van der Laan M. The deletion/substitution/addition algorithm in loss function based estimation. J Stat Methods Mol Biol 2004;3:Article 18.Google Scholar

  • 39. Hastie T. Statistical models in S, Wadsworth and Brooks/Cole, chapter 7: Generalized Additive Models, 1991.

  • 40. Williams C. Learning in graphical models. Cambridge, MA: The MIT Press, chapter Prediction with Gaussian processes: from linear regression to linear prediction and beyond, 599–621, 1999.Google Scholar

  • 41. Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;29:367–78.Google Scholar

  • 42. Gelfand A, Diggle P, Fuentes M, Guttorp P, editors. Handbook of spatial statistics. Boca Raton: CRC Press, 2010.Google Scholar

  • 43. Stone C, Hansom M, Kooperberg C, Truong Y. The use of polynomial splines and their tensor products in extended linear modeling (with discussion). Ann Stat 1997;25:1371–470.Google Scholar

  • 44. Breiman L. Random forests. Machine Learning 2001;45:5–32.Google Scholar

  • 45. Green P, Silverman B. Nonparametric regression and generalized linear models. number 58 in Monographs on Statistics and Applied Probability. Boca Raton: Chapman & Hall/CRC, 1994.Google Scholar

  • 46. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 1979;31:377–403.Google Scholar

  • 47. Ellers J, Landers D, Brakke D. Chemical and Physical Characteristics of Lakes in the Southeastern United States. Environmental Science and Technology 1988;22:172–7.Google Scholar

  • 48. Gu C. gss: general smoothing splines. Available at: http://CRAN.R-project.org/package=gss, r package version 2.0-11, 2012.

  • 49. Ledell E. h2oEnsemble. Available at: http://www.stat.berkeley.edu/~ledell/software.html, 2015.

  • 50. Lendle S. OnlineSuperLearner. Available at: https://github.com/lendle/OnlineSuperLearner.jl, 2015.

  • 51. Carmack P, Schucany W, Spence J, Gunst R, Lin Q, Haley R. Far casting cross-validation. J Comput Graph Stat 2009;18:879–93.Google Scholar

  • 52. Györfi L, Kohler M, Walk H. A distribution-free theory of nonparametric regression. Springer Series in Statistics. New York: Springer, 2002.Google Scholar

  • 53. van der Laan M, Dudoit S, Keles S. Asymptotic optimality of likelihood based cross-validation. Stat Appl Genet Mol Biol 2004;3:Article 4.Google Scholar

  • 54. Dudoit S, van der Laan M. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005;2:131–54.Google Scholar

About the article

Published Online: 2016-04-29

Published in Print: 2016-05-01


Citation Information: The International Journal of Biostatistics, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X, DOI: https://doi.org/10.1515/ijb-2014-0060.

Export Citation

©2016 by De Gruyter. Copyright Clearance Center

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

[1]
Santiago Esteban, Manuel Rodríguez Tablado, Francisco E. Peper, Yamila S. Mahumud, Ricardo I. Ricci, Karin S. Kopitowski, and Sergio A. Terrasa
Computer Methods and Programs in Biomedicine, 2017
[2]
Monique A. Ladds, Adam P. Thompson, Julianna-Piroska Kadar, David J Slip, David P Hocking, and Robert G Harcourt
Animal Biotelemetry, 2017, Volume 5, Number 1

Comments (0)

Please log in or register to comment.
Log in