Skip to content
BY-NC-ND 4.0 license Open Access Published by De Gruyter Open Access December 31, 2018

Predictive analytics of insurance claims using multivariate decision trees

Zhiyu Quan and Emiliano A. Valdez EMAIL logo
From the journal Dependence Modeling


Because of its many advantages, the use of decision trees has become an increasingly popular alternative predictive tool for building classification and regression models. Its origins date back for about five decades where the algorithm can be broadly described by repeatedly partitioning the regions of the explanatory variables and thereby creating a tree-based model for predicting the response. Innovations to the original methods, such as random forests and gradient boosting, have further improved the capabilities of using decision trees as a predictive model. In addition, the extension of using decision trees with multivariate response variables started to develop and it is the purpose of this paper to apply multivariate tree models to insurance claims data with correlated responses. This extension to multivariate response variables inherits several advantages of the univariate decision tree models such as distribution-free feature, ability to rank essential explanatory variables, and high predictive accuracy, to name a few. To illustrate the approach, we analyze a dataset drawn from the Wisconsin Local Government Property Insurance Fund (LGPIF)which offers multi-line insurance coverage of property, motor vehicle, and contractors’ equipments.With multivariate tree models, we are able to capture the inherent relationship among the response variables and we find that the marginal predictive model based on multivariate trees is an improvement in prediction accuracy from that based on simply the univariate trees.


[1] Breiman, L. (2001). Random forests. Mach. Learn. 45(1), 5-32.10.1023/A:1010933404324Search in Google Scholar

[2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Taylor & Francis, Boca Raton FL.Search in Google Scholar

[3] De’ath, G. (2002). Multivariate regression trees: A new technique for modeling species-environmental relationships. Ecology 83(4), 1105-1117.Search in Google Scholar

[4] Deprez, P., Shevchenko, P. V., and Wüthrich, M. V. (2017). Machine learning techniques for mortality modeling. Eur. Actuar. J. 7(2), 337-352.10.1007/s13385-017-0152-4Search in Google Scholar

[5] Elith, J., Leathwick, J. R., and Hastie, T. (2008). A working guide to boosted regression trees. J. Anim. Ecol. 77(4), 802-813.10.1111/j.1365-2656.2008.01390.xSearch in Google Scholar

[6] Frees, E. W. and Lee, G. (2015). Rating endorsements using generalized linear models. Variance 10(1), 51-74.Search in Google Scholar

[7] Frees, E. W., Lee, G., and Yang, L. (2016). Multivariate frequency-severity regression models in insurance. Risks 4(4), 36.10.3390/risks4010004Search in Google Scholar

[8] Frees, E. W. and Valdez, E. A. (2008). Hierarchical insurance claims modeling. J. Amer. Statist. Assoc. 103(484), 1457-1469.10.1198/016214508000000823Search in Google Scholar

[9] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29(5), 1189-1232.10.1214/aos/1013203451Search in Google Scholar

[10] Friedman, J. H. (2002). Stochastic gradient boosting. Comput. Statist. Data Anal. 38(4), 367-378.10.1016/S0167-9473(01)00065-2Search in Google Scholar

[11] Friedman, J. H. and Meulman, J. J. (2003). Multiple additive regression trees with application in epidemiology. Statist. Med. 22(9), 1365-1381.10.1002/sim.1501Search in Google Scholar PubMed

[12] Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3), 453-467.10.1093/biomet/58.3.453Search in Google Scholar

[13] Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Syst. Appl. 39(3), 3659-3667.10.1016/j.eswa.2011.09.058Search in Google Scholar

[14] Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.Search in Google Scholar

[15] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.Search in Google Scholar

[16] Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15(3), 651-674.10.1198/106186006X133933Search in Google Scholar

[17] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer, New York.Search in Google Scholar

[18] Jolliffe, I. T. (1986). Principal Component Analysis and Factor Analysis. Springer, New York.10.1007/978-1-4757-1904-8_7Search in Google Scholar

[19] Lee, S. C. and Lin, S. (2018). Delta boostingmachine with application to general insurance. N. Am. Actuar. J. 22(3), 405-425.10.1080/10920277.2018.1431131Search in Google Scholar

[20] Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News 2/3, 18-22.Search in Google Scholar

[21] Loh, W.-Y. (2014). Fifty years of classification and regression trees. Int. Stat. Rev. 82(3), 329-348.10.1111/insr.12016Search in Google Scholar

[22] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonça, A. (2011). Data mining methods in the prediction of dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res. Notes 4(299), 14.10.1186/1756-0500-4-299Search in Google Scholar PubMed PubMed Central

[23] Milborrow, S. (2016). Plotting rpart trees with the rpart.plot package. Available at in Google Scholar

[24] Miller, P. J., Lubke, G. H., McArtor, D. B., and Bergeman, C. (2016). Finding structure in data using multivariate tree boosting. Psychol. Meth. 21(4), 583-602.10.1037/met0000087Search in Google Scholar PubMed PubMed Central

[25] Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Stat. Assoc. 58(302), 415-434.10.1080/01621459.1963.10500855Search in Google Scholar

[26] Muñoz, J. and Felicísimo, Á. M. (2004). Comparison of statistical methods commonly used in predictive modelling. J. Veget. Sci. 15(2), 285-292.10.1111/j.1654-1103.2004.tb02263.xSearch in Google Scholar

[27] Olbricht, W. (2012). Tree-based methods: a useful tool for life insurance. Eur. Actuar. J. 2(1), 129-147.10.1007/s13385-012-0045-5Search in Google Scholar

[28] Pande, A., Li, L., Rajeswaran, J., Ehrlinger, J., Kogalur, U. B., Blackstone, E. H., and Ishwaran, H. (2017). Boosted multivariate trees for longitudinal data. Mach. Learn. 106(2), 277-305.10.1007/s10994-016-5597-1Search in Google Scholar PubMed PubMed Central

[29] Ridgeway, G. (2018). gbm: Generalized Boosted Regression Models. R package version 2.1.4. Available on CRAN.Search in Google Scholar

[30] Ridgeway, G. (2007b). Generalized Boosted Models: A guide to the gbm package. Available at in Google Scholar

[31] Segal, M. and Xiao, Y. (2011). Multivariate random forests. Data Min. Knowl. Discov. 1(1), 80-87.10.1002/widm.12Search in Google Scholar

[32] Shi, P. and Yang, L. (2018). Pair copula constructions for insurance experience rating. J. Amer. Stat. Assoc. 113(521), 122-133.10.1080/01621459.2017.1330692Search in Google Scholar

[33] Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8, 25.10.1186/1471-2105-8-25Search in Google Scholar PubMed PubMed Central

[34] Tan, P.-N., Steinbach, M., and Kumar, V. (2006). Introduction to Data Mining. Pearson Education Limited, Harlow.Search in Google Scholar

[35] Ter Braak, C. J. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology 67(5), 1167-1179.10.2307/1938672Search in Google Scholar

[36] Therneau, T., Atkinson, B., and Ripley, B. (2018). rpart: Recursive partitioning and regression trees. R package version 4.1-13. Available on CRAN.Search in Google Scholar

[37] Thuiller, W., Araújo, M. B., and Lavorel, S. (2003). Generalized models vs. classification tree analysis: predicting spatial distributions of plant species at different scales. J. Veget. Sci. 14(5), 669-680.Search in Google Scholar

[38] Wüthrich, M. V. (2018). Machine learning in individual claims reserving. Scand. Actuar. J. 2018(6), 465-480.10.1080/03461238.2018.1428681Search in Google Scholar

[39] Wüthrich, M. V. and Buser, C. (2018). Data analytics for non-life insurance pricing. Available at in Google Scholar

[40] Xiao, Y. and Segal, M. R. (2009). Identification of yeast transcriptional regulation networks using multivariate random forests. PLoS Comput. Biol. 5(6), e1000414.10.1371/journal.pcbi.1000414Search in Google Scholar PubMed PubMed Central

Received: 2018-07-18
Accepted: 2018-12-05
Published Online: 2018-12-31
Published in Print: 2018-12-01

© by Zhiyu Quan, Emiliano A. Valdez, published by De Gruyter

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Downloaded on 5.12.2022 from
Scroll Up Arrow