Predictive analytics of insurance claims using multivariate decision trees

Abstract Because of its many advantages, the use of decision trees has become an increasingly popular alternative predictive tool for building classification and regression models. Its origins date back for about five decades where the algorithm can be broadly described by repeatedly partitioning the regions of the explanatory variables and thereby creating a tree-based model for predicting the response. Innovations to the original methods, such as random forests and gradient boosting, have further improved the capabilities of using decision trees as a predictive model. In addition, the extension of using decision trees with multivariate response variables started to develop and it is the purpose of this paper to apply multivariate tree models to insurance claims data with correlated responses. This extension to multivariate response variables inherits several advantages of the univariate decision tree models such as distribution-free feature, ability to rank essential explanatory variables, and high predictive accuracy, to name a few. To illustrate the approach, we analyze a dataset drawn from the Wisconsin Local Government Property Insurance Fund (LGPIF)which offers multi-line insurance coverage of property, motor vehicle, and contractors’ equipments.With multivariate tree models, we are able to capture the inherent relationship among the response variables and we find that the marginal predictive model based on multivariate trees is an improvement in prediction accuracy from that based on simply the univariate trees.


Introduction
A decision tree model, with origins that date back to early 1960's, is a data mining algorithm that can broadly be described by repeatedly partitioning the regions of the explanatory variables and thereby creating a treebased model for predicting the response.[25] developed the very rst naive regression tree algorithm and called it the Automatic Interaction Detection (AID).For more details about the historical development of decision trees including alternative algorithms such as GUIDE and C4.5 algorithms, see [21].Today, the use of decision trees has become an increasingly popular alternative predictive tool for building classi cation and regression models.Considered a supervised learning technique, it has many advantages which are especially important for analyzing actuarial data.
First, a decision tree model is considered to be nonparametric and thereby does not require distribution assumptions.Unlike classical statistical methods, decision tree models do not require the input of any probability distributions about the response.Second, it is an e ective algorithm that can handle missing data.For many real datasets, the absence and the unrecording of some information is not uncommon.Third, apart from the ability to handle missing data, it can detect non-linear e ects and possible interactions among the explanatory variables.Traditional linear models typically capture only linear e ects, and detection for non- linearity as well as interactions requires further analysis.Fourth, it can be considered as a variable selection procedure by assessing the relative importance of the explanatory variables.Such variable selection is usually important in actuarial science for purposes such as risk classi cation and collection of risk variables.Finally, decision trees, especially with smaller-sized trees, are straightforward to interpret by a visualization of the tree structure in the plot.These advantages are particularly useful for actuarial and insurance data.See [34].
It should be pointed out that comparing prediction accuracy between traditional linear models and decision tree models (and their ensemble methods) may be unsuitable due to the manner in which rules and principles are applied.However, in practice, it is understandable to make such comparison in order to better evaluate the quality between models.Indeed, in the literature where applications are emphasized, there have been some papers that make direct comparison of prediction accuracy between traditional statistical models and machine learning algorithms.For example, see [22], [26], and [37].
For a simple illustration of how regression trees are constructed, consider the insurance claims data obtained from the R package MASS.Here we have 64 observed claims and we consider three potential explanatory variables: Age, District, and Group.In Figure 1, we show how the amount of claims are segmented by Age and District.Note that Group is omitted in this gure because it is not considered signi cant.Figure 2 shows the nal structure of the decision tree.This tree corresponds to the segmentation of Age and District as demonstrated in Figure 1 where it clearly makes the separation according to shade.This illustrates a simple diagram of the separation of nodes within a regression tree model.We can conclude from these gures that claims are generally signi cantly larger for ages 35 and above, while districts 3 and 4 have slightly lower claims than districts 1 and 2. For similar gures of how decision trees are constructed for classi cation, see [21] and [34].
The early methods of decision trees have the potential disadvantages of producing irregular patterns resulting in over tting and bias in variable selection.Innovations and extensions to the original methods, such as random forests and gradient boosting, further improved the capabilities of using decision trees as a predictive model.Random forest refers to the process of generating ensembles of trees with a set of unpruned fully-grown trees.These trees are generated based on a bootstrap sampling of the original data and using a subsample of the explanatory variables.[1] showed that the use of random forests led to signi cant improvements in prediction accuracy.
Boosting algorithms have increased in popularity in machine learning because they help to nd a good balance between bias and variance through the tuning parameters.For decision tree models, boosting algorithms build trees sequentially so that for each new iteration, a tree is grown using the residuals from previously grown trees.This procedure combines weak learners to produce strong learner.Early methods of boosting decision trees, as discussed in [10], used optimization based on gradient descent algortihms and this gave rise to the term gradient boosting.
Here we cite some interesting applications of decision tree models in actuarial science and insurance.Interestingly, for example, [27] provided an alternative look of the life table construction using tree-based models and concluded that tree-based methods have inherent characteristics that capture intrinsic data structure useful for identifying primary risk factors.In their lecture notes on data analytics for non-life insurance pricing, [39] used classi cation trees to determine whether a policy belongs to a male or female driver given some policy characteristics.[13] used the idea of gradient boosting (GB) to predict auto accident loss cost and concluded that this method provided more superior predictive accuracy than that of traditional Generalized Linear Models (GLMs).[19] introduced Delta Boosting (DB) as an alternative boosting algorithm and showed that this algorithm is optimal under a variety of loss functions.Using claims data on collision coverage for vehicle insurance from a Canadian insurer, the article also demonstrated that the DB algorithm outperforms the GB algorithm.[38] applied classi cation and regression trees to calculate reserves on individual claims data.[4] applied the Poisson regression tree and its boosting ensemble to examine the quality of mortality models in understanding di erent causes of death.
As evident from our previous discussion, applications of decision tree models have practically been based on a single-valued response variable.This paper extends the concept of decision trees in the case where we have a multiple-valued, or multivariate, response variable.In the literature in recent years, we have seen a large potential for actuarial and insurance applications where we encounter multivariate responses.To illustrate, here are some sources of dependencies that we often encounter in actuarial and insurance problems: (a) a single policyholder may have several insurance coverages, (b) a policyholder may own a bundle of insurance contracts such as homeowners and automobile, (c) a taxicab company may own an umbrella coverage for several automobiles, (d) a corporation may own several types of insurances for its employees such as workers compensation and health insurance, and (e) time and spatial dependencies are typical insurance data structures.
To demonstrate the potential bene ts of tting multivariate decision trees, we examine a multivariate response variable with six di erent components; each component is associated with one of the six types of property and casualty insurance coverages for local government units established by the Wisconsin Local Government Property Insurance Fund (LGPIF).Earlier works on this dataset have used the concept of parametric copulas to analyze the multivariate structure of the data.Our paper examines the bene ts of using multivariate decision tree models without having to specify probability distributions.Yet another advantage of using decision tree models is avoiding the use of a two-part frequency-severity model.When compared to univariate decision tree models, we nd that multivariate decision tree models have generally a better predictive accuracy.
The remainder of this paper has been organized as follows.In Section 2, we discuss the concept of univariate regressions and its extensions.In section 3, we describe the concept of decision trees when the response variable is multivariate.In Section 4, we describe the dataset used for our empirical investigation and provide some preliminary data exploration.Results of model calibration and model validation are presented in Section 5 and Section 6, respectively.Finally, we provide concluding remarks in Section 7.

Univariate decision trees and its extensions
In this section, we introduce the concept of the univariate regression tree and its extensions.Here we assume that we have a dataset consisting of a vector of p explanatory variables, denoted by x i = (x i , x i . . ., x ip ) and a response variable, y i , for each of N observations.This dataset is best represented as (x i , y i ) for i = , . . ., N. For our paper, we discuss the three of the most widely used univariate decision trees: CART (Classi cation and Regression Tree), random forests, and gradient boosted regression trees.

. CART (Classi cation and Regression Trees)
First introduced by [2], this method uses recursive partitioning to build decision trees applicable for predicting either a continuous response, in the case of regression, or a categorical response, in the case of classi cation.In our subsequent discussion, we focus on regression trees for predicting continuous response variable.We adopt notation from [15].In this algorithm, a regression tree, denoted by T(x, Θ), is produced by partitioning the space of the explanatory variables into M disjoint regions R , R , . . ., R M and then assigning a constant cm for each region Rm, for m = , , . . ., M. Given a regression tree, each observation can then be modeled based on the expression where Θ = {Rm , cm} M m= denotes the partition with the assigned constants.Under CART, the constants cm are determined by minimizing the sum of squares error (SSE) loss function: where is the predicted value of the response variable.It can be shown that the optimal value, ĉm, is the average of y i in the region Rm: where Nm is the number of observations in region Rm.
The regions in the regression tree are determined according to an algorithm called recursive binary splitting.The initial step in this algorithm is to nd one explanatory variable X •j which best divides the data into two subregions, for example, R (j, s) = {x i |X •j < s} and R (j, s) = {x i |X •j ≥ s} in the case of a continuous explanatory variable.This division is determined as the solution to argmin j,s i: (y i − ĉR (j,s) ) , for any j and s.
Subsequently, the algorithm looks for the next explanatory variable with the best division into two subregions and this process is applied recursively until reaching a minimum size of observations in the terminal region or some other prede ned threshold.The algorithm can handle other types of numerical explanatory variables, such as those with rank order, as well as categorical variables.Furthermore, regression trees can deal with missing values in the explanatory variables using 'surrogate splitting' which involves nding a surrogate variable that best approximates the original split.Observations with missing values are assigned according to the split on the surrogate variable rather than on the original splitting variable.For many instances, the result is a fully grown tree, T , with many terminal regions that may lead to over tting and unnecessary model complexity.This complexity can be controlled by using cost-complexity pruning to trim the fully grown tree T .From equations (2) and (3), de ne the loss in region Rm by For any subtree T ⊂ T , denote the number of terminal regions in this subtree by |T|.To control the number of terminal regions, we introduce the tuning parameter α ≥ to the loss function by de ning the new cost function as Clearly according to this cost function, the tuning parameter penalizes large number of regions.The idea then is to nd the subtrees Tα ⊂ T for each α, and choose the subtree that minimizes Cα(T).Furthermore, the tuning parameter α governs the tradeo between size of the tree and its goodness of t to the data similar to the regularization parameter in a penalized regression.Large values of α result in smaller trees and as the notation suggests, α = leads to the fully grown tree T .The estimation of this tuning parameter α is done using K-fold cross-validation.
Algorithm 1 summarizes the details of implementing the CART procedure using the R-package rpart.See [17] and [36].

Algorithm 1: CART R-package: rpart
Input: Training dataset x, y, K Output: Best subtree Tα 1 Grow a full tree T on a training dataset using recursive binary splitting.Use the stopping criterion minsplit which is the minimum number of observations in a region for a split to be attempted; 2 Prune the full tree T to subtrees Tα using cost-complexity pruning; 3 Divide the training dataset into K folds to determine the optimal tuning parameter α; 4 for k = , . . ., K do 5 Repeat steps 1 and 2 on all except for the k-th fold;

6
Compute the mean squared prediction error on the hold out k-th fold using Tα; 7 end 8 Average the results for each value of α and pick α that minimizes the average prediction error; 9 Return the best subtree Tα; . Random forests Random forest regression was rst developed by [1] and it refers to an ensemble of unpruned regression trees {T(x, Θ b ), b = , , ..., B} which are generated based on a bootstrap sampling from the original training dataset.As a result, we can think of {Θ b } as independent and identically distributed random vectors.With B as the total number of bootstrap samples, we de ne the model for the response variable as the average of all the regression trees in the random forest: By the Strong Law of Large Numbers, as B → ∞, we have Random forests produce a limiting value of the generalization error and this explains why they do not over t as more trees are added.
For each bootstrap sample, the regression trees are produced using a random subset of explanatory variables in order to decrease the similarity among the trees.The average prediction of multiple regression trees is expected to have a lower variance than that of individual regression trees.While larger random set of explanatory variables can improve the predictive capability of individual trees, it can also increase the similarity among the trees and could therefore void any gains from averaging multiple predictions.The size of the random subset of explanatory variables can be optimally chosen through cross validation; some use rule of thumbs such as the square root of the total number of explanatory variables, √ p, in the training dataset.
The bootstrap resampling of the data for training each tree also increases the variation between the trees.The accuracy of a random forest depends on the strength of the individual tree and a measure of the similarity between them.
For implementing random forests in R, see [20].The procedure to produce random forests is summarized in Algorithm 2. .

Gradient boosted regression trees
Developed by [9], gradient boosting algorithm involves building several regression trees sequentially.For each new iteration in the sequential process, a regression tree is grown using information from previously grown regression trees.In other words, each subsequent regression tree focuses on learning from the residuals obtained from previous trees.The result is a set of S regression trees Ts(x; Θs), for s = , . . ., S. The gradient boosted regression tree model is expressed as the sum of such trees: where Θs = {Rms , cms} Ms m= and fs(x|Θs) are the correponding models produced by the trees Ts(x; Θs).Here, S is also referred to as the number iterations in the process.For each step s, we nd the optimal Θs by solving the problem: We note that where cms = arg min Under the sum of squared errors (SSE) loss function, it simpli es to the regression tree that best predicts the current residuals y i − F s− (x i | Θ), and cms is the mean of these residuals in each corresponding region.
For other di erentiable loss functions, the solution to equation ( 4) can be obtained by numerical optimization via gradient boosting as described in [9].The regression trees Ts(x; Θs) produced at each step are analogous to the components of the negative gradient: for i = , . . ., N.
Therefore, solving equation ( 4) is equivalent to solving the following: Gradient boosting regression models can be implemented using the R package gbm.See [29] and [30].The process is summarized in in Algorithm 3. Update Fs( where shrinkage λ is used to reduce the impact of each additional tted base-learner, regression tree, Ts(x; Θs). 10 end

Extensions to multivariate decision trees
Decision trees discussed in the previous section are based on a single-valued response variable.In this section, we extend the concept of decision trees in the case where we have a multiple-valued, or multivariate, response variable.This extension has a large potential for actuarial and insurance applications where we commonly encounter multivariate responses.See, for example, [8] and [7].To x ideas, we assume to have a dataset of N observations with p explanatory variables and q response variables.That is, for i = , . . ., N, we have (x i , y i ) where x i = (x i , x i . . ., x ip ) and y i = (y i , y i . . ., y iq ).Multivariate decision trees are naturally extended from the univariate trees by expressing the loss function that is based on the multivariate nature of the responses in order to capture the possible association of the multiple responses.Here we discuss three possible extensions: multivariate regression trees (MRT), multivariate random forests, and multivariate tree boosting.The steps in Algorithms 1-3 are very similar to that of multivariate decision trees; the only di erence lies in the splitting criteria which is based on multivariate loss function.

. Multivariate regression trees
The idea of multivariate regression trees originated in ecology by [3] where the author analyzed relationships between multiple species and the environment.Similar to the univariate case, we produce regression tree by partitioning the space of the explanatory variables into M disjoint regions R , R , . . ., R M and then assigning constant cm k for each region Rm, m = , , . . ., M in the k-th response for k = , , . . ., q. De ne the vector of k constants as cm = (cm , cm , . . ., cm q ).Given the regression tree, the multivariate response variable for each observation is then modeled as where Θ = {Rm , cm} M m= denotes the partition with the assigned constants.In the multivariate regression trees (MRT) developed by [3], one method to nd the constants cm is to minimize a multivariate sum of squared error loss function as: where y i is the predicted value of the multivariate response based on equation (5).For the conventional sums of squared deviations, each terminal nodes can be represented by the multivariate mean of response variables, the number of observations at the terminal node, and the explanatory variable that de nes the segmentation.This is similar to the univariate case where in the multivariate extension, we can demonstrate that ĉm k is the average of the k-th response y ik in the region Rm.
Although [3] primarily focused on the sums of squared deviations, he discussed extensions of the concept of multivariate regression trees by using two other multivariate loss functions.One of these loss functions is based on the multivariate sums of absolute deviations about the median as de ned below: where ỹk is the median for the response variable k.This measure is generally more robust to outliers when compared to the conventional sums of squared deviations.The other extension is using a distance-based measure by considering the sums of squared dissimilarities within decision nodes.Therefore in this case, because we want regions to be as dissimilar as possible, the splitting criterion is to maximize the reduction of within-node sums of squared distances at each split.
As with univariate regression trees, this extension to multivariate regression trees also inherits some of its advantages such as improved predictions, especially when the data includes missing values, lack of balance, nonlinearity, and high order interactions.Furthermore, multivariate regression trees provide bene ts of grouping e ects on the response variables.In other words, we can consider this procedure to be characterized as a constrained clustering.There are a few ways that we can interpret this.First, we can visualize the contributions of the relative importance of each explanatory variable in the split of the tree model.Second, we can display the multivariate group means in the terminal nodes, as well as clustering information, through the use of tree biplots constructed with dimension reduction by principal components.This is further discussed in the section on data estimation.We use the R package mvpart for calibrating the multivariate regression trees to data.
In his survey paper, [21] provided a short discussion about similar work on regression trees for longitudinal and multiresponse variables.It is worth mentioning that the unbiased recursive partitioning framework for building trees proposed in [16] can be extended to multivariate regression trees.

. Multivariate random forests
Recall that in the univariate case, we describe the concept of random forest where we generate an ensemble of regression trees using bootstrapping resampling.This technique has the advantage of producing better predictions by avoiding over tting, an aspect of random forest regression that is often underappreciated.To similarly improve predictions with multiple responses, [31] extended the idea of multivariate random forests for applying regression trees when there may be anticipated dependencies among the several responses.First, in this technique, recursive binary splitting for partitioning the space of explanatory variables by minimizing a 'covariance' weighted loss function as de ned by where V = Cov(y) represents the covariance matrix of the multivariate response variable.The dependence structure is characterized by this covariance that can be described to capture di erent patterns (e.g., compound symmetry, unstructured, autoregressive, spatial power).Similar to univariate random forests, we generate an ensemble of such multivariate decision trees using bootstrap resampling.
In [40], multivariate random forests was applied in the area of genetics to examine and analyze "transcriptional regulation networks" that are said to be important for understanding the physiological processes in yeast under various stress conditions.In ecology, [31] applied multivariate random forests to identify various environmental characteristics and understand their e ects on the habitat of several spider species.In particular, the results revealed a "dynamic relationship between environment and species".To be able to compare the results between multivariate regression trees and multivariate random forests, [31] used the same dataset as that used by [3].The multivariate random forests did not produce reasonable results for our dataset and therefore, we did not present the results from this technique.However, this technique has advantages that may work for other datasets and therefore we provide this discussion.

. Multivariate tree boosting
In order to nd and interpret structure in datasets with multiple response variables and several explanatory variables (even possibly exceeding the sample size), [24] introduced the idea of extending the gradient boosted regression models to the multivariate case.This gives rise to the term multivariate tree boosting.In their abstract, [24] said that this extension "is a method for nonparametric regression that is useful for identifying important predictors, detecting predictors with nonlinear e ects and interactions without speci cation of such e ects" where the multivariate response variable represents several outcomes that are correlated.
The concept of multivariate tree boosting has similarities to the multivariate regression trees discussed in section 3.1.Regression trees are also constructed based on the multivariate response variable where each tree is modeled in terms of the optimal partitions of the space of explanatory variables into regions.The constants assigned in the regions are identi ed according to the solution of miminimizing the multivariate sum of squared error (SSE) loss function as de ned in equation (7).However, unlike the multivariate regression trees, the multivariate tree boosting constructs several trees, as in the univariate tree boosting, done sequentially where at each iteration, a regression tree is grown through learning and using information from previously grown trees.Learning is accomplished by an examination of the residuals; those with smaller residuals (observations with good predictions) are assigned smaller weights than those with higher residuals (observations with bad predictions) in the next iteration.During the learning process, splits are limited for each iteration without having to produce a fully grown tree which may result in over tting.In essence, the model performance is "boosted" by correcting bad predictions.Multivariate tree boosting can be implemented using the R package mvtboost that is based on [24].
In [24], multivariate tree boosting was applied in the area of psychology to understand the impact of demographic characteristics as well as physical and health states on the psychological well-being of aging adults.In the paper, psychological well-being is the multivariate response variable with six scales: "autonomy, environmental mastery, personal growth, positive relationships with others, purpose in life, and selfacceptance."Interestingly, [28] applied the concept of gradient multivariate tree boosting for longitudinal data that is based on understanding the health status of patients over time after lung transplantation.
To describe the algorithms for the multivariate extension, we use the multivariate response with the corresponding multivariate loss function in Algortihms 1, 2, and 3. To ease use of the R packages, we also summarize the hyperparameters that required tuning for the various decision tree methods in Table 6 in the appendix.

The LGPIF Data
For the empirical section of this paper, we use the dataset with information about the insurance coverage for buildings, vehicles, and equipments of local government units in Wisconsin.These units include, for example, cities, towns, villages, counties, school districts, re departments, and other miscellaneous entities in the state.Funds to cover property and casualty insurance coverages for these local government units are established by the Wisconsin Local Government Property Insurance Fund (LGPIF).This dataset was drawn from a project of the actuarial research group at the University of Wisconsin and additional details about this project can be found at the website.This dataset has been extensively studied and analyzed in [6], [7], and [32].We do not replicate the GLM approach used on this same dataset in earlier works.Instead, to analyze the multivariate structure of the data, these previous works used mainly parametric models; this paper emphasizes the bene ts of using tree-based models without having to specify probability distributions.For our purpose, we used observations for years 2006-2010 as training set and year 2011 as validation set.
In the LGPIF data, there are six types of insurance coverage: Building and Contents (BC), Contractor's Equipment (IM), Comprehensive New (PN), Comprehensive Old (PO), Collision New (CN), and Collision Old (CO).BC provides insurance for buildings and their contents, IM provides insurance for equipments mainly belonging to contractors, and the rests provide comprehensive and collision coverages for moving vehicles.For more description of these types of coverages, please refer to Table 3 of [6].For our purpose of tting tree-based models, we examine a multivariate response variable with six di erent components, with each component associated to each of the six types of insurance coverages.We describe these six variables in Table 1.We note that these variables are transformed with a logarithmic scale where we added one to each average claim size, per year, to accommodate the zero claims.Zero claims indicate either there was no claims made or simply no coverage provided for the year.From hereon, we will simply describe these as the logarithm of the claim size.; for all other types of coverage, the proportion of positive claims ranges between 4% and 6%.The means of the logarithm of the claim sizes for all types are in the range of 7.5 to 9.0, or in terms of the original dollar scale, this range is between 1800 and 6500, with BC giving the largest mean claim size.The table also indicates that the largest claim size comes from the BC type of coverage.
Figure 3 provides density plots of the logarithm of the claim sizes where we excluded the zeroes and, for ease of comparison, we used the same scale for all types of coverage.The top portion gives the frequency by count while the bottom portion gives the frequency by proportion.We can deduce some interesting observations from these plots.First, as also indicated in Table 2, the BC coverage clearly shows most frequent positive claims and the top portion of the gure shows that this is also true for all ranges of claim sizes.Second, as shown in both the top and bottom portion, the BC coverage has the largest variability among all types and also the most positively skewed distribution.Finally, we also observe a variety of distribution shapes for the di erent types of coverage.If we consider the large point mass at zero, this presents a challenging task of specifying marginal parametric distributions for these claims.This is one reason we use distribution-free tree-based models to t these claim sizes.We also provide Figure 4 which also shows how frequent claims are for the BC coverage because of the dominance of the shade for this coverage.This gure is a stacked density graph which includes the zero claim sizes.However, at zero claim size, BC does not dominate; as shown in Table 2, BC has the smallest proportion of zeroes.
Our primary interest is to build tree-based models that account for the dependence among the components of a multivariate response variable.Hence, we also need to examine evidence of the presence of correlation between the components.In this case, we present Figure 5 to show the strength of dependence between the components, with a darker shade and larger circle indicating stronger correlation.To illustrate, we nd that coverage PN and PO has the largest correlation of 0.51.Interestingly we also note that all pairs of coverage have positive correlations.
In the Appendix, we de ne the explanatory variables we consider in constructing the regression trees.In general, the continuous explanatory variables are logarithms of the coverage amount for each type and notice that for coverage type BC and IM, we have deductible amounts.No deductible amounts are available for the other types.Similarly, for the categorical explanatory variables, we have indicators for entity type (e.g., city, town, school district) as well as indicators for "no claim credit" for each type of coverage.The latter indicators provide information about the presence or absence of claims in the prior year.See Appendix for more details including statistics.
As done in the two-part model of [7], we used log-transformed coverage amounts as explanatory variables.In other approaches especially for the severity component, coverage amounts may be alternatively used as exposures in the model.However, treating them as explanatory variables is the correct approach for us since we do not separate the frequency and severity components.The zero part of the claim will lose coverage information when treated as exposure.

Model calibration
In this section, we calibrate and build regression tree-based models to the LGPIF data described in Section 4. For the purpose of this section, we use the data for years 2006 to 2010 as our training sample.A cruical part of successfully training a tree-based model is to control model complexity in order to maintain a good balance between bias and variance that leads to high prediction accuracy.More speci cally, a simple treebased model may cause under tting (low variance) with high bias; on the other hand, a complex tree-based model may cause over tting (low bias) with high variance.See [15].To address these issues, for training treebased models, we use k-fold cross-validation on the training sample in tuning the parameters.In applying cross-validation to nd the optimal set of tuning hyperparameters, the grid search is the most commonly  used algorithm that employs an exhaustive search process on speci ed subset of hyperparameter space.We choose the nal model with the lowest cross-validation prediction error.
R codes can be readily provided upon request and at a future date, these codes will be posted on our research grant website.

. Fitting univariate decision tree models . . The univariate regression trees based on CART
In tting the CART procedure to our training dataset, without loss of generality, we only consider the component BC.We rst grow a full tree using recursive binary splitting starting with a minimum number of ve observations in a region for a split to be attempted.The stopping criterion of a minimum number of ve observations is the result of tuning this parameter; the default for the rpart package is 20.We next prune this full grown tree using cost-complexity pruning with 10-fold cross validation to determine the optimal number of splits, which in our case is 8.The result of the pruning process can be visualized in Figure 6 that shows the cost-complexity parameter, or equivalently the number of splits, in relation to the cross-validation relative error.This cross validation relative error is simply the percentage change in the mean squared prediction error.By choosing the cost-complexity parameter that corresponds to the smallest cross validation relative error, the gure shows that the optimal number of splits is 14.However, it is recommended to adjust the choice of this optimal number of splits by considering one standard error above this minimum relative error as indicated with a horizontal line.This adjustment is necessary to avoid over tting.Hence, this gives parsimonious model to choose 8 as the optimal number of splits.See [2].The red dot corresponds to the smallest cross validation relative error while the orange dot corresponds to the one standard error above this cross validation relative error.In Figure 6, the green dots refer to the training relative error with each successive split reducing the mean squared prediction error.

Cost−complexity parameter
After accounting for the optimal cost-complexity, we produce the optimal univariate regression tree in Figure 7 using [23].The nodes at the bottom provide the information on prediction values.The gure shows the set of explanatory variables and split points used to form the nodes.The explanatory variable with the most signi cant e ect is selected for the rst split and each following split is conditioned on all previously selected explanatory variables.For example, the rst split uses CoverageBC which has the most signi cant e ect and divides into two child nodes.Hence, the decision tree model can capture interactions between explanatory variables.

. . Random forest regression
The recursive partitioning algorithm used in constructing decision trees often lead to the problems of over tting.It is well established that such over tting can lead to unreliable prediction of future or new observations.In other words, decision tree models can drastically change from sample to sample, which leads to low prediction accuracy.A remedy for over tting is the use of random forests.Because of the multiplicy of regression trees necessary in random forests, we focus our discussion on two important aspect of tting random forest regression.
An important phase is determining the number of trees to generate that will eventually be used for prediction.Figure 8 shows the relationship between the number of trees and the out-of-bag (OOB) error rate.The performance of the estimation is measured based on the out-of-bag samples that were not used during the learning stage.The prediction error computed from this estimation is called the OOB error.As to be expected, the fewer the trees, the larger this OOB error.However, due to large sample theory, the OOB error rate will eventually level o and deciding on the number of trees for building the random forests is based on this leveling o , as demonstrated in Figure 8.In deciding on the number of trees, we need to address the balance of using many trees in order to get more stable prediction and using fewer trees in order to achieve e ciency.Based on the gure, we decid that the optimal number of trees to use is 200.Another important phase is the measurement of the variable importance produced by the random forests using the selected optimal number of trees.In this case, there are two popular variable important measures used.The rst one incorporates a weighted mean of the improvement of the individual trees based on the splitting criterion produced by each explanatory variable; [9].The "IncNodePurity" measures the degree of impurities of a split at each node of a tree based on the loss function.This is accomplished by adding up all the decreases in the loss function for each explanatory variable over all the trees in the random forests.A higher value of the "IncNodePurity" represents a higher variable importance.Figure 9 shows CoverageBC has highest IncNodePurity with 28414 and followed by lnDeductBC with 4623.
The second commonly used measurement is the permutation accuracy importance."%IncMSE" calculates the deterioration of the prediction accuracy of the random forests when permuting the values of each explanatory variable of the test set in order to break the association with the response variable.For random forest regression, it is the average increase in squared residuals of the test set when the explanatory variable is permuted.A higher %IncMSE value represents a higher variable importance.Figure 9 shows CoverageBC has highest %IncMSE with 7.95%, followed by TypeSchool with 1.26% and lnDeductBC with 1.21%.
It is typically understood that %IncMSE provides the more accurate measure for variable selection because the IncNodePurity may have a bias in the variable selection inherited by the regression tree.To illustrate such a bias, potential explanatory variables may vary in their scale of measurement or their number of categories.This explains the variable importance may vary signi cantly between the two measures for some datasets, however, our dataset does not present this issue.Random forests variable importance measures may still be a sensible means for variable selection in many applications, but may not be too reliable in other situations.See [33].

. . Gradient boosting
Both the random forests and the gradient boosting are major improvements to the CART algorithm in terms of prediction accuracy.While both create ensemble of trees from weak learners, gradient boosting is performed iteratively to develop a strong learner.For details of the procedure for gradient boosting, please see section 2.3.In tting gradient boosted trees to our dataset, we mainly used the gbm R package.The procedure includs tuning for the number of trees to grow and the size of the weak learner tree.Based on our dataset, we use 10-fold cross validation to determine these tuning parameters and we nd that the optimal number of trees is 1000, the interaction depth is 5, the minimum number of observations to terminate the splitting process is 5, and the shrinkage parameter is 0.005.The result of creating gradient boosted regression trees is best summarized in terms of the relative in uence of the explanatory variables.In this case, the relative in uence is calculated by the number of times an explanatory variable is selected for splitting, weighted by the improvement in the sum of squared errors to the model as a result of each split, and then, averaged over all the trees.See [11].The relative in uence of each variable is further standardized so that the sum adds up to 100%, with the larger value indicating a stronger in uence on the response variable.Figure 10 displays the relative in uence of the explanatory variables with the highest impact in predicting yAvgBC.According to this gure, it is not surprising to nd that CoverageBC is the most dominant explanatory variable with 60.28%; this is followed by lnDeductBC with 13.56%.

. Fitting multivariate decision tree models . . Multivariate regression trees
In this and the subsequent subsection, we discuss the results of tting multivariate tree-based models.Similar to the univariate case, we select the complexity parameter to determine the optimal size of the tree and we can accomplish this by examining Figure 11, which displays the relationship between the complexity parameter and the cross-validation relative error of the prediction.It can be inferred from this gure to select 0.0064 as the cost-complexity parameter.See Figure 6 for additional details.
After the cross-validation, we produce the nal multivariate regression trees using our training dataset and this is displayed in Figure 12.This is the smallest tree within one standard error of the minimum relative error.This tree has eight terminal nodes with an estimated predictive error that explains (1-0.739)*100% of total variance.During each recursive binary splitting, it also shows the cyclical shadings across the bar plots that indicate the six response variables di erentiated from left to right.At each terminal node, the height of each bar gives the mean of the respective response variables and n indicates the number of observations within that node.Recall that the split is determined according to the multivariate squared error loss function that captures the dependence structure of the multivariate response.Colored circles help to identify each node.The length of the branch is strictly proportional to the reduced amount of the variance (SSE) by the explanatory variable used in the split.The length of each branch is proportional to the percentage of explained variance and the splits are in descending order of importance.Note from Figure 11 that TypeCounty variable explains most of the variance; in particular, immediately following the rst split, the nal tree explains (1-0.841)% of the variance, while the rest of the splits together account for (0.841-0.739)% of the variance.This is in contrast from Figure 7 where CoverageBC was chosen as the rst split.

Cost−complexity parameter
It is di cult to compare the di erences of the level of claims for each component of the response vector: BC, IM, PN, PO, CN, and CO.This is because the mean response is multivariate for each terminal node.However, multivariate regression trees can be viewed as constrained clustering.The terminal nodes are similar to clusters with respect to a measure of dissimilarity, e.g., squared error loss function, with each cluster de ned by the region {R} M m= .See [14].Still this makes it di cult to visualize but we reduced the dimensionality using the rst two principal components in a Principal Component Analysis (PCA); the details of this dimension reduction are not necessary for our purpose.We refer the reader to [18].
In Figure 13, we show the biplots representing the tree structure in Figure 12.See [12] and [35].In the tree biplot, the large colored circles, consistent with the terminal nodes, represent the response vector means in the terminal nodes.The small colored points, again consistent with each node, represent PCA-projected individual observations.The label for each response variable in the gure is indeed the corresponding weighted mean calculated from the means of all the terminal nodes.From the gure, we deduce that the overall average claim for BC is very far apart from those of the lines of coverage.We can also see that the overall average claims for CN, CO, PN, and PO are relatively close; these lines of coverage all refer to moving vehicles.In addition, this helps explain the relatively weak dependence between BC and other lines; see also [7].

. . Multivariate tree boosting
In this subsection, we improve the multivariate regression trees with gradient boosting where we sequentially grow trees by learning the information from previously grown trees.The rst important step is to determine the optimal size of the tree as well as other tuning parameters.Figure 14 displays the mean squared error in  From this gure, we conclude that the optimal size of the tree should be 1495.We note that this gure was based on a shrinkage of 0.005; this shrinkage parameter was determined on a trial-and-error basis through several iterations.Similar to the univariate gradient tree boosting, it is important to determine the relative in uence of each explanatory variable because this reveals their degrees of variable importance.Excluding some irrelevant explanatory variables from the models is favorable for prediction accuracy.To understand the variable importance of all the explanatory variables with respect to all response variables, we display this in the form of a heat map Figure 15 which lists the explanatory variables on the x-axis and the six response variables on the y-axis.To illustrate, consider the BC coverage where we nd that these three explanatory variables have the following relative variable importance: CoverageBC with 59.2%, CoverageIM with 19.0%, and lnD-eductBC with 10.7%.Furthermore, it is apparent from the gure that the variable TypeCounty is important across the di erent coverage.We recall that the explanatroy variable TypeCounty is used at the rst split in the MRT, see the gure 12.For each coverage in the gure, we can examine the relative variable importance horizontally.In the gure, the numbers displayed, as well as the darkness of the shade, provide a sense of the degree of variable importance.The variable importance here account for the dependence structure between the variables.
Another way to explain the e ect of each explanatory variable to the dependence structure is to measure the contribution of each variable to the changes in the covariance.[24] de ned the covariance discrepancy, D, to measure the di erence between sample covariance matrices, Σ, of multiple response variables at each gradient descent step.At each gradient descent step b, the explanatory variable chosen by the algorithm, explains the covariance discrepancy, D b,k , after the k-th response variable tted where Summing overall the covariance discrepancies at each step measures the contribution to the covariance explained by each speci c explanatory variable.Figure 16 shows a heatmap of the covariance discrepancies for any pair of response variables (y-axis) as explained by each explanatory variable (x-axis).We can deduce from the gure that TypeCounty, CoverageCN, and CoverageCO explain covariance discrepancies across a wide range of pairwise response variables.In essence, these explanatory variables are useful for detecting the presence of dependency for the multivariate response.On the other hand, coverageBC explains most of the covariance discrepancies only for pairs that contain BC, and coverageIM explains most of the covariance discrepancies only for pairs that contain BC or IM.
Additional information that provides the e ect of each explanatory variable for each response variable is provided in the appendix.

Model validation and comparison
This section provides details about the result of comparing the performance of our ve di erent models.For this purpose, we use the calibrated models discussed in the previous sections to make predictions based on the validation (or test) dataset.In addition, we compare the various models only for BC; we extracted the prediction for yAvgBC from the marginals for multivariate tree-based models.This is done in order not to overwhelm the reader, and at the same time, BC coverage has the most number of observations.Similar comparisons can be made for other coverages.For summary statistics for this test dataset, we refer the reader to the appendix.
There are several prediction accuracy measures but there is no unique perfect measure that can be used to judge prediction accuracy under all circumstances.Each measure has its own focus, which also leads to its shortcomings.To make a fair comparison between di erent models, we utilize a few popular measures.These measures are: coe cient of determination R , Gini index, mean error (ME), mean percentage error (MPE), mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).These validation measures are each de ned in the appendix.
The results of comparing the performance measures of the ve tree-based models we calibrated are summarized in Table 3.The comparative values from these tables are straightforward to interpret, however, it is better to make model comparisons based on these values graphically.Figure 17 provides a heatmap comparing the performance of the various models according to the validation measures.For ease of comparison, this heatmap has been organized by rescaling all the measures so that for each measure, a value of 100 is the best and a value of 0 is the worst.For R and the Gini index, we know that the higher the better; for these measures, we nd the highest value in each column and scale it to 100.For all the other measures, since the smaller the better, we multiplicatively invert (take the reciprocal) the original value and then apply the rescale.The gure   is also color coded so that dark blue represents the best and the dark red represents the worst.Anything that performs in between is relatively measured according to their degree of closeness to each of these colors.Overall, we can say that multivariate tree-based models generally outperform univariate tree-based models.From gure 17, although multivariate tree boosting is not always the best predictive model, it is clear that, overall, multivariate tree boosting generally outperforms all the other predictive models.Gradient boosting and random forests, on the other hand, provide values of validation measures that are fairly close to those from multivariate tree boosting.Multivariate tree boosting has the unique additional feature that captures the dependency structure of the response variables.This helps the prediction accuracy only slightly better because we do not have very strong presence of dependencies of BC with the other coverages.Not surprisingly, both the univariate and the multivariate regression trees underperformed in comparison to the ensemble models.This is because as already pointed out, ensemble models, such as gradient boosting and random forsts, make weak learners into strong learners.

Concluding remarks
In this paper, we explore decision trees and its extensions as predictive models for insurance claims.Decsion tree, or tree-based, models have increased in popularity in various disciplines (e.g., psychology, ecology, and biology) and in recent years because of its advantages as an alternative predictive tool.In particular, we extend the usefulness of tree-based predictive models to the case where we have a multivariate response vector.
For the empirical part of our investigation, we analyze the LGPIF data that contains claims information about insurance coverage for properties owned by local government units in Wisconsin.We compare the predictive performance of various univariate tree-based models (regression trees, random forests, boosting) against multivariate tree-based models (multivariate regression trees and multivariate tree boosting).Broadly speaking, the multivariate tree-based models generally outperform the univariate tree-based models using a set of di erent validation measures.In particular, multivariate tree boosting provides the best model according to several validation measures.These results explain the importance of building predictive models that account for dependencies.In actuarial science, insurance, and nance, we have witnessed in recent years the importance of capturing the dependency structure of a multivariate response when developing tools for prediction.
For future work, we plan to explore the use of multivariate decision trees where the loss function is di erent from the sum of squares; an even more interesting extension is the use of a multivariate loss function tailored for a zero-in ated claim structure.Another interesting theoretical and practical problem is related to dealing with zero claims in decision trees.

Figure 1 :Figure 2 :
Figure 1: Insurance claims: segmentation of the explanatory variables Insurance claims

Figure 3 :Figure 4 : 1 −
Figure 3: Density plots of the logarithm of the positive claim size by type of coverage, 2006-2010 (training dataset)

Figure 5 :
Figure 5: Correlations between components of the multivariate response variable

Figure 6 :
Figure 6: Choosing the optimal number of splits

Figure 7 :
Figure 7: The optimal univariate regression tree

Figure 8 :Figure 9 :
Figure 8: Random forests: the number of trees vs the out-of-bag (OOB) error rate

Figure 10 :
Figure 10: Relative influence of the explanatory variables in gradient boosting

Figure 11 :
Figure 11: Choosing the optimal number of splits for the multivariate regression trees

Figure 13 :Figure 14 :
Figure 13: Multivariate regression tree biplots using PCA to reduce into two dimensions

Figure 15 :
Figure 15: A heatmap of variable importance using multivariate tree boosting

Figure 16 :
Figure 16: A heatmap of covariance discrepancies for pairs of response variables

2 :
Summary statistics of the variables in the 2011 validation dataset.

Figure 18 :Figure 19 :
Figure 18: Partial dependence plots of yAvgBC with explanatory variables that are directly associated with the BC coverage

Table 1 :
Description of the six components of the multivariate response variable.Table 2 provides summary statistics for the logarithm of the claim sizes for the training dataset, which consists of observations for the period years 2006-2010.The proportion of positive claims for BC is about 30%, the highest among all types of coverage

Table 2 :
Summary statistics of the six components of the multivariate response variable, 2006-2010 (training dataset).

Table 3 :
Comparison of model validation measures.