Skip to content
Publicly Available Published by De Gruyter August 10, 2022

Robust statistical boosting with quantile-based adaptive loss functions

  • Jan Speller ORCID logo EMAIL logo , Christian Staerk and Andreas Mayr ORCID logo

Abstract

We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient L 2 loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.

1 Introduction

Statistical tools for data analysis and regression are becoming more frequently confronted with large and complex data situations [1]. The availability of data with a rising number of measured variables and observations as well as the generally increasing number of data sources provide many new research possibilities but also challenges. This is particularly true in (bio-)medical research, where, for example, genome-wide association studies and large cohort data lead to advances in genetic epidemiology. Such observational data often contain outliers or might be corrupted in some way, which is in general not easy to detect.

High-dimensional data situations, where the number of covariates exceeds the number of observations (p > n or even pn), amplify this difficulty of data analysis in the context of regression. Appropriate tools for these situations have to address the two methodological issues of robustness against potentially corrupted data (e.g., via robust loss functions) and of handling the high-dimensionality of the data (e.g., via regularization and variable selection).

Classical least squares regression, for example, is highly sensitive to outliers and even single observations can have a major impact on final model estimates [2, 3]. Many robust regression approaches have been proposed for low-dimensional settings with a relatively small number of covariates (p < n), including M-estimation, S-estimation and MM-estimation [4]. However, these classical inference approaches tend to become unfeasible for high-dimensional settings (p > n), where additionally some form of data-driven variable selection or regularization needs to be imposed.

In the general context of high-dimensional statistical modelling, pre-selection or screening approaches are often used to reduce the dimensionality of the data before the analysis (for an overview see [5]). Another way to address large numbers of explanatory variables is to employ a regularization approach, such as ridge regression (e.g., [6]) or lasso regression (e.g., [7]). The L 1 penalization of the coefficient estimates via the lasso is particularly beneficial in this context, as it incorporates variable selection – leading also to a better interpretability of the sparse model.

Another computational concept to fit such sparse and interpretable models is statistical boosting [8], which also provides the basis for our approach. Statistical boosting builds on a concept from machine learning and iteratively updates the estimated regression coefficients using component-wise gradient descent (for a non-technical introduction of boosting see [9]). An empirical risk function is minimized and the algorithm is stopped after finitely many iterations. The main tuning parameter for statistical boosting is the stopping iteration, which is typically specified via resampling to optimize the predictive performance [10]. Early stopping of the algorithm leads to automatic variable selection and shrinkage of effect estimates, avoiding overfitting and identifying small subsets of covariates (e.g., genetic variants) for closer examination. The concept of statistical boosting has been extended towards different types of statistical models, including generalized additive models [8, 11], linear mixed models [12], geoadditive regression [13], distributional regression [14, 15], categorical regression [16] and quantile regression [17, 18].

Lately, there has also been much research activity regarding developing, enhancing and improving modern robust statistical approaches for high-dimensional linear regression (see [19] for a recent overview). In general, the term “robust” has multiple meanings, since it depends strongly on the context and on which parts of the data analysis should be considered “robust”. For example, there is given special focus on heterogeneity for heavy-tailed distributed errors in regularized expectile regression [20] and for statistical learning problems [21]. Robust versions of screening approaches (e.g., [22, 23]) and of penalized approaches like ridge regression [24] and the lasso [25, 26] have been proposed; furthermore, new approaches for M-estimators motivated by wavelet thresholding [27] have also been considered. Lutz et al. [28] encountered high leverage points in the explanatory variables (also known as horizontal outliers) with several ways of using robustified (linear) base-learners in component-wise L 2 boosting. Additionally, a robustification for the outcome contains an auxiliary scale estimator to impose a dependency between the current residual sizes and the robustness parameter of the Huber loss. Ju and Salibián-Barrera [29] very recently proposed a two step approach for robust linear regression, applying first L 1 regression trees as an initialization step for the non-parametric and robust estimation of the residual scale (S-type boosting), followed by gradient boosting (M-type boosting).

In this work we want to take advantage of the modular structure of the boosting algorithm and incorporate different robust loss functions with respect to outlier corrupted data in the outcome (vertical outliers). For continuous outcome variables we propose a combination of robust linear regression using composite loss functions and a data-adapted variant of statistical gradient-boosting, combining their advantages of robustness and sparsity while preserving the interpretability of classical robust regression models. In contrast to the common L 2 and L 1 losses, several robust loss functions such as the Huber and Bisquare loss are composed of two parts which are separated based on a threshold parameter k. To distinguish the two parts of composite loss functions all arguments with absolute value smaller or equal to k are considered to be located within the “inner” part. Higher absolute values than k are falling into the “outer” part of the composite loss. We adjust these loss functions by a quantile-based approach, which adapts the threshold parameter in each iteration of the boosting algorithm to the current fit. In contrast to previous robust boosting approaches [28, 29], the proposed quantile-based approach does not require the estimation of an auxiliary scale parameter for the composite loss functions during the boosting process or in an additional initialisation step. Instead, our approach is based on the choice of a fixed quantile level regarding the separation of observations into an amount evaluated by the inner (non-robust) part and an amount evaluated by the outer (robust) part of the composite loss functions. This results in an adaptive robustness threshold k [m] based on the sizes of the current residuals after each boosting iteration m.

The remainder of this article is structured as follows: after giving a short overview of robust regression and boosting in Sections 2.1 and 2.2, the new approach with quantile-based adaptive loss functions is presented in Section 2.3. Details of the implementation are shortly discussed in Section 2.4. In Section 3, our approach is illustrated and evaluated in a simulation study, considering different types and amounts of corruption (non-systematic and systematic) in low- and high-dimensional settings. The analysis of skewed antibody array data (NCI-60 panel) with n = 59 cancer cell lines is presented in Section 4, where for p = 14, 951 gene expression predictors the robust quantile-based adaptive loss functions were applied within the boosting framework to model the KRT19 protein expression outcome. A comparison to the lasso and a robust version of the lasso is provided. The corresponding R-code implementing our approach, and also the code to reproduce our analysis is available on GitHub (https://github.com/JSpBonn/Robust-losses-Statistical-Boosting).

2 Methods

For n observations (x 1, y 1), …, (x n , y n ) of p explanatory variables with x i R p and a continuous outcome y i R , we consider a linear regression model

(1) y i = β 0 + x i β + ε i , i = 1 , , n ,

with intercept β 0, regression coefficient vector β = ( β 1 , , β p ) and independent errors ɛ i .

The estimation of (β 0, β ) can be facilitated by the choice of an appropriate loss function ρ, leading to the empirical risk function R n 1 i = 1 n ρ ( y i f ( x i ) ) with linear predictor f ( x i ) β 0 + x i β , which has to be minimized for ( β 0 , β ) R p + 1 . For the special case of a squared error loss ρ ( y i f ( x i ) ) = ( y i f ( x i ) ) 2 one obtains the ordinary least squares estimator (LSE) for classical mean regression. The LSE is highly sensitive to outliers, which led to the development of alternative loss functions for more robust regression estimators.

2.1 Robust Regression

We focus on methods for robust regression for data situations with potentially (highly) corrupted values y i for the outcome variable and uncorrupted values x i for the explanatory variables. To reduce the influence of potential outliers on model fitting, robust loss functions ρ have been proposed (cf. [27] and the references therein). A standard loss function that leads to robust models is the L 1 loss ρ(z) = |z|, corresponding to median regression and yielding the maximum likelihood estimator assuming Laplacian-distributed errors. Compared to the L 2 loss, the L 1 loss can result in inefficient estimators in uncorrupted data situations [3]. Additionally, in case there are different amounts, types or magnitudes of corruptness in the data, the L 1 loss function cannot be adapted.

In order to address these issues, composite loss functions consist of specific segments, the “inner” and “outer” part separated by a specific robustness threshold k ≥ 0, for response observations that deviate strongly from the model fit (large residuals). An important example is the so-called Huber loss [30], which forms a smooth and convex mixture of L 1 and L 2, given by

(2) ρ H ( z ) = z 2 / 2 , | z | k k | z | k / 2 , | z | > k .

The choice of the robustness threshold k is essential, as it specifies which observations are evaluated by the outer, robust L 1 or inner, non-robust L 2 loss, respectively. It is important to note that for composite loss functions like the Huber loss, the robustness threshold k depends on the scale of the residuals z i = y i f(x i ). Typically, robust approaches like the [normalized] median absolute deviation (MAD[N]) are used for estimating the standard deviation σ ̂ of the residuals z i to standardize the argument of the loss function [3]. This leads to the following optimization problem:

(3) min β 0 , β n 1 i = 1 n ρ y i f ( x i ) σ ̂

The solution of (3) is a so called “M-estimator”, a generalization of the maximum likelihood estimator [3].

The typically used robustness threshold k = 1.345 for the Huber loss is derived theoretically to achieve 95% asymptotic efficiency compared to the classical estimator based on the L 2 loss for the case of non-corrupted data with Gaussian errors [30]. In addition to the idealized Gaussian error, Huber [30] also considered the weaker assumption of a composition of a Gaussian center and a corrupted outer part of the distribution, where the underlying cumulative distribution function (cdf) for the standardized residuals is assumed to be F = (1 − π)Φ + πH, with Φ representing the standard Gaussian cdf and H an undefined symmetric cdf. The proportion π ∈ [0, 1] reflects the amount of corruption in the distribution. Under these model assumptions, Huber [30] showed that the maximal asymptotic variance among all translation invariant estimators is minimized by the Huber loss, when the robustness threshold k is connected to the amount of corruption π via

(4) ( 1 π ) 1 = 1 2 Φ ( k ) + 2 Φ ( k ) / k .

In this particular case, there exists a unique asymptotically “least informative” or “least favorable” distribution F with Gaussian center and double exponential tails. Equation (4) implies that the smaller the robustness threshold k is chosen for the Huber loss, the larger we implicitly assume the corrupted part πH of the distribution F. Huber recommended a robustness threshold value of k ∈ (1, 2) for standardized residuals, as this choice is not very sensitive regarding the underlying model [30].

The typically used robustness threshold k = 1.345, corresponding to an asymptotic efficiency of 95% under Gaussian errors, can be translated via Eq. (4) to a proportion π = 5.8% of corrupted outcome data. The lower boundary k = 1 corresponds to π = 14.3% corrupted response observations and an asymptotic efficiency of around 90.3% for Gaussian errors, the upper boundary k = 2 corresponds to around π = 0.8% corrupted data with 99% efficiency for Gaussian errors [30]. The strength of the Huber loss is its flexibility to induce robust and efficient estimates in different data situations. The classical L 1 loss, for comparison, leads to an efficiency of only 63.7% under Gaussian errors (cf. [3], p.21).

An alternative composite loss function for robust regression is Tukey’s Bisquare loss [27], which is a non-convex and smooth loss function defined by

(5) ρ B ( z ) = 1 1 z / k 2 3 , | z | k 1 , | z | > k .

The gradient of ρ B vanishes for outer parts of the distribution of z (leading to a so-called redescending estimator). This highlights a clear distinction between non-corrupted observations and potential outliers. Taking scaling into account, the 95% asymptotic efficiency under Gaussian distributed errors is reached for the Bisquare loss with the robustness threshold k = 4.685 [31]. A larger robustness threshold k implicitly leads to fewer observations being considered as outliers, while a smaller robustness threshold k leads to a larger amount of observations weighted by the outer part of the loss function. Figure 1 illustrates the composite Huber and Bisquare loss functions in comparison to standard L 1 and L 2 losses.

Figure 1: 
The L
1, L
2, Huber and Bisquare loss functions. The composite robust loss functions Huber and Bisquare are depicted for the robustness threshold k = 1.
Figure 1:

The L 1, L 2, Huber and Bisquare loss functions. The composite robust loss functions Huber and Bisquare are depicted for the robustness threshold k = 1.

For a given robustness threshold k, the task is then to solve the minimization problem (3) in order to estimate the regression coefficients (β 0, β ). For the low-dimensional case (n > p), one can, for example, use the iteratively reweighted least squares method (IRLS). For high- and ultra-high-dimensional cases, where fewer observations than explanatory variables are available (n < p or even np), this is no longer feasible.

2.2 Statistical Boosting

Statistical boosting adapts a concept from machine learning to estimate statistical models for high-dimensional data, incorporating implicit variable selection and shrinkage of effect estimates through an iterative fitting process [32].

The empirical risk function n 1 i = 1 n ρ ( y i f ( x i ) ) with loss function ρ is effectively minimized using a component-wise gradient procedure. The boosting algorithm starts with an offset value f ̂ [ 0 ] ( x i ) = β ̂ 0 [ 0 ] , which in classical L 2 boosting is usually set to the empirical mean of the response. Here also other location parameters, e.g., the median or weighted variants of the mean, can be used for less sensitive offset values. Then the estimate f ̂ [ m ] is iteratively updated for m = 1, …, m stop. In each boosting iteration m, the negative gradient of ρ is evaluated at the current residuals:

(6) u [ m ] f ρ y i f ̂ [ m 1 ] ( x i ) i = 1 , , n

Afterwards pre-specified sets of base-learners for each covariate are fitted component-wise one-by-one to the gradient vector. Typically, each base-learner δ ̂ j [ m ] corresponds to one covariate, for j = 1, …, p, and the base-learners δ ̂ 1 [ m ] , , δ ̂ p [ m ] are fitted separately one after another to the gradient vector u [m]. In case of linear base-learners δ ̂ j [ m ] ( x i ) = β ̂ 0 [ m ] + β ̂ j [ m ] x i , j , this boils down to the separate fitting of p linear models by an ordinary least squares approach. Only the best performing base-learner δ ̂ j * [ m ] for each iteration is chosen as the one leading to the smallest residual sum of squares regarding u [m].

Finally, the current model is updated with this best-performing base-learner fit by f ̂ [ m ] = f ̂ [ m 1 ] + ν δ ̂ j * [ m 1 ] with a fixed learning rate 0 < ν ≤ 1 until the stopping iteration m stop is reached. Thereby, m stop is the main tuning parameter of statistical boosting usually optimized with respect to prediction performance via cross-validation or resampling approaches. Early stopping of the algorithm leads to sparse models via automatic variable selection. This is due to the design of the algorithm starting only with an offset vector and as variables related to base-learners that have never been selected as δ ̂ j * [ m ] for any m ∈ {1, …, m stop} are effectively excluded by retaining a coefficient estimate of 0. Furthermore, early stopping induces implicit regularization resulting in shrinkage of effect estimates towards zero, which avoids overfitting and can lead to higher prediction accuracy [10]. The L 2 loss function corresponds to classical mean regression [33], leading basically to refitting of residuals from the previous iteration.

In general, statistical boosting can be seen as an alternative approach to penalized regression methods as the lasso, where similar performance and model are expected even if the concept is very different (for a detailed comparison of both approaches, see Hepp et al. [34]). The lasso directly penalizes the absolute value of the coefficient estimates leading to shrinkage towards zero and sparse models. Boosting, in contrast, indirectly imposes regularization via early stopping of the algorithm. For the lasso the main tuning parameter is the shrinkage parameter λ. Larger values of λ lead to stronger regularized effect estimates and sparser models. In comparison to boosting, the lasso is in general faster in respect to computing time. On the other hand, one main advantage of boosting is its modular structure and the resulting flexibility, e.g., in combining different sets of base-learners and loss functions [34], which we also take advantage of in our work. The lasso and a non-sparse robust version are used as a comparison in the simulation study (see Section 3) and for the analysis of the NCI-60 panel (see Section 4).

2.3 Adaptive robust loss functions for boosting

We propose a new approach which combines the composite robust loss functions discussed in Section 2.1 and statistical boosting from Section 2.2 in order to obtain robust coefficient estimates and variable selection for high-dimensional regression models. The key to our approach is the development of adaptive loss functions, where the robustness threshold iteratively depends on the current residuals and keeps the amount of observations in the inner and outer part of the loss function constant.

Due to the modular nature of the boosting algorithm it is rather straightforward to extend the approach to new loss functions, as they can essentially be plugged into the existing framework (see the appendix of [32] for an example). However, in case of composite robust loss functions like the Huber and Bisquare losses (see Figure 1), the choice of the robustness threshold k requires special attention. As a result of the iterative fitting process in statistical boosting, the sizes of the residuals vary considerably over different iterations m of the algorithm. If one simply uses the composite robust loss functions with a constant threshold k for all boosting iterations, there is the challenge of determining the robustness threshold k by scaling the outcome as discussed in Section 2.1. E.g., if the k has been chosen initially to small for the Bisquare loss, an uncontrolled amount of observations are considered as outliers, which means also that a lot of information will be lost. On the other hand, when the threshold k is chosen too large, almost all residuals will be evaluated only with the inner part of the loss function, resulting in non-robust estimates. The same challenge as the adequate choice of k at the beginning of the algorithm occurs similarly for larger iterations m, when the residuals are decreasing in size. This is due to the fact that the component-wise gradient descent algorithm leads to a decreasing empirical risk function, which tends to result in overall smaller residuals. Hence, the proportion of observations falling in the outer part of the composite loss decreases (see Figure 2). To avoid such undesirable behaviour of gradient boosting, Friedman [35] recommended an adaptive approach for boosted classification trees by robustly re-estimating the standard deviation of the residuals in each iteration of the algorithm. This approach was also followed and adapted in further works [28, 29].

Figure 2: 
Visualization of residual paths of boosting using the Huber loss for a simulated data example with 0% (left) and 15% non-systematic corruption (right, blue) for the low-dimensional setting from Section 3 (p = 50 and n = 200). In contrast to a fixed robustness threshold k (green), the adaptive robustness τ-quantiles (red) with default quantile level τ ≈ 0.82 are adjusted based on the residual sizes (y-axis) along the boosting iterations m (x-axis).
Figure 2:

Visualization of residual paths of boosting using the Huber loss for a simulated data example with 0% (left) and 15% non-systematic corruption (right, blue) for the low-dimensional setting from Section 3 (p = 50 and n = 200). In contrast to a fixed robustness threshold k (green), the adaptive robustness τ-quantiles (red) with default quantile level τ ≈ 0.82 are adjusted based on the residual sizes (y-axis) along the boosting iterations m (x-axis).

Here, we propose to use adaptive robustness τ-quantiles with a fixed quantile level τ to maintain the initial balance between the two parts of the composite loss function throughout all boosting iterations. Therefore, as illustrated in Figure 2, the loss function with its robustness threshold k is automatically adapted to the sizes of the residuals z i [ m 1 ] = y i f ̂ [ m 1 ] ( x i ) of the current fit via

(7) k [ m ] quantile τ | z i [ m 1 ] | , i = 1 , , n ,

where k [m] is the τ-quantile in the m-th boosting iteration regarding the absolute values of the current residuals z i [ m 1 ] . During the boosting procedure, the residuals are implicitly reordered based on their absolute values, so that individual observations can effectively be evaluated by different parts of the composite loss in different iterations of the algorithm (see Figure 2, residual paths crossing the red line). An important consequence of this adaptive approach is that there is no need for an additional estimation of the scale parameter during (and also before) the fitting process. For larger iterations m the residuals are stabilizing in size, which is also k [m] depending on these (Figure S1 from the supplement shows an additional simulation of the evolution of k [m] over 500 iterations).

In order to choose a sensible default robustness quantile level τ for statistical boosting with our approach, one may consider concepts from classical M-estimation under non-corrupted Gaussian errors (see Section 2.1). The quantile level τ could hence be determined as the amount of residuals in the interval [−k, k] with the standard Gaussian cdf Φ. In particular, for the adaptive Huber loss, the popular robustness threshold k = 1.345 used in classical M-estimation (based on 95% efficiency) would lead to the quantile level τ ≈ 0.82. For the Bisquare loss, the classical robustness threshold of k = 4.685 (for the same efficiency under Gaussian distributed errors [31]) would be reflected by a quantile level τ ≈ 1. Note that, by construction (see Figure 1), the Bisquare loss assigns a constant value to all residuals larger than the robustness threshold k. However, for robustness quantile levels τ very close to one, no observations would fall into the outer robust part of the loss. Thus, in our simulations and the gene expression data application, we only consider robustness quantile levels with τ ≤ 0.99 (e.g., τ ∈ {0.9, 0.95, 0.99}), which guarantees a robust behaviour of the fitting process in the sense, that always at least 1% of the observations are considered by the robust part of the loss function.

In practical biomedical applications it is rarely known how many observations might be corrupted in some way. However, in situations where a prior estimate for the amount of corruption π is available, this could also be considered when choosing a sensible robustness quantile level τ. Note that in our approach the quantile level τ describes the quantity of observations evaluated by the inner part of a loss function (scale-independently), while 1 − τ reflects the quantity evaluated by the outer part of the loss function. In case of the Bisquare loss, one could consider a direct and interpretable translation τ = 1 − π between the robustness quantile level τ and the assumed amount of corruption π. For the Huber loss, specifying the robustness quantile level τ based on an assumed amount of corruption π is slightly more difficult. In contrast to the Bisquare loss, also in the robust outer part of the Huber loss the current residuals enter with their actual sizes – not with fixed values. Considering a direct translation like τ = 1 − π might not yield the intended robustness. Applying Eq. (4) based on the minimax result of Huber [30], the popular robustness threshold k = 1.345 corresponds to an amount of corruption π of around 5.8%.

In the simulation study (Section 3) and the high-dimensional cancer cell lines application (Section 4) we investigated different choices of quantile levels τ for the adaptive loss functions, considering also different types and amounts of assumed corruption (Section 3). Overall, in the considered settings the results of our approach were not very sensitive regarding the choice of the quantile level τ.

2.4 Implementation

The new adaptive robust loss functions for statistical boosting are implemented as corresponding family objects for the R add-on package mboost [36], see also for a tutorial [32]. This way, our approach builds up on a well-tested implementation, which includes also a flexible functionality regarding tuning, prediction and graphical representations of resulting models and effects. These new loss functions can be specified via the argument family = AdaptHuber()or family = AdaptBisquare() inside the fitting functions of mboost and are available on GitHub (https://github.com/JSpBonn/Robust-losses-Statistical-Boosting) as is the code for reproducing results of Section 3 and 4.

Inside these family functions, the user can specify one of the three different arguments tau, k, pi for the robustness parameter of the composite loss functions, whose relation is described in Section 2.3. The specified argument will always be transferred to a robustness quantile level τ ∈ (0, 1). If no argument is specified, the default robustness quantile level τ is around 0.821 for the adaptive Huber loss, which corresponds to the robustness threshold k = 1.345. For the adaptive Bisquare loss, the default robustness quantile level is set to τ = 0.99, which can be interpreted as an approximation related to the often-used robustness threshold of k = 4.685 (when assuming non-corrupted Gaussian responses). The offset value for these adaptive robust loss functions is set to the median of the response, which is more robust to outliers than the empirical mean.

As an illustrative example, the linear regression model y = β 0 + β 1 x 1 + β 2 x 2 can be fitted with statistical boosting in combination with the adaptive Huber loss using the robustness quantile level τ = 0.8 via:

mod_h <- glmboost(y ∼ x1 + x2, data = data.frame(y, x1, x2), family = AdaptHuber(tau = 0.8))

The fitting function glmboost() automatically applies linear base-learners, although alternative implementations for non-linear spline base-learners are also available [32]. Furthermore, different tuning methods for the stopping iteration m stop can be generically combined with the adaptive robust loss functions.

3 Simulations

The aim of our simulation study was to investigate the predictive performance of statistical boosting in combination with the proposed adaptive robust loss functions – also in comparison to classical estimation methods for robust regression, boosting with classical L 1 and L 2 losses and also lasso approaches. In particular, we examined the potential benefits of using adaptive robust loss functions when parts of the data are corrupted, as well as potential limitations when they are applied in uncorrupted data situations. Additionally, we also investigated the variable selection properties of the considered boosting and lasso approaches.

3.1 Settings

We considered a linear regression setting with p ∈ {50, 500, 15000} explanatory variables, of which only a small subset of 6 variables was truly informative. The true underlying coefficient vector was set to

β = ( 1.5 , 1.0 , 0.5 , 0.5 , 1.0 , 1.5 , 0 , , 0 ) R p .

To generate training data of size n, we simulated the covariates x i N p ( 0 , Σ ) from a multivariate normal distribution with covariance matrix Σ R p × p consisting of diagonal entries σ i,i = 1 and off-diagonal entries σ i,j = 0.5 for ij. A sensitivity analysis with Toeplitz covariance matrix structure with correlation ρ = 0.8 for neighbouring covariates can be found in the online supplement. The outcome variable y i = x i β + ε i was generated using different types of error terms ɛ i to reflect the presence of outliers. The uncorrupted observations were simulated using independent standard normally distributed errors ɛ i N(0, 1). To generate data situations with varying numbers of corrupted observations, π ∈ {0%, 7.5%, 15%} of the considered outcome values were either (i) corrupted by an additional non-systematic error ε i N ( 0,4 ) with a larger variance or (ii) by a systematic error via randomly adding or subtracting 4 ⋅ sd(y) from the original y i , corresponding to so-called heavy-tailed distributions [20, 21]; Figure 3 illustrates the conceptually different corruptness mechanisms, note that the location of the corrupted outliers inside the sample is irrelevant. Our corruption mechanism reflects a data generating process of two different kinds, where the amount of the mixture is fix, which is obviously not known in practice. We also performed a sensitivity analysis with non-symmetric corruption (outliers where the corruption is a one-sided bias) which can be also found in the online supplement.

Figure 3: 
Simulated response observations y

i
 for n = 200 samples with an amount of 15% non-systematic (blue squares, left) and systematic (red triangles, right) corruption for sample indices i = 1, …, 30. Non-corrupted response observations y

i
 for sample indices i = 31, …, 200 are displayed as black dots. The location of the corrupted observations inside the sample (here the ones with the smallest index) is irrelevant, they could also be spread randomly across the sample.
Figure 3:

Simulated response observations y i for n = 200 samples with an amount of 15% non-systematic (blue squares, left) and systematic (red triangles, right) corruption for sample indices i = 1, …, 30. Non-corrupted response observations y i for sample indices i = 31, …, 200 are displayed as black dots. The location of the corrupted observations inside the sample (here the ones with the smallest index) is irrelevant, they could also be spread randomly across the sample.

To compare the prediction performance of the different methods, for each simulation replicate we generated n test = 1,000 uncorrupted test observations by the same process to evaluate the mean absolute error of prediction (MAEP) by calculating n test 1 i = 1 n test | y i f ̂ ( x i ) | for each estimated final model f ̂ . Furthermore, for each method we examined the estimated coefficients and selected variables. Each simulation setting was fitted and evaluated B = 1,000 times. To illustrate the performance for various data situations, we considered a low-dimensional setting with p = 50 and n = 200, a high-dimensional (p > n) setting with p = 500 and n = 400 as well as an ultra-high-dimensional (pn) setting with p = 15,000 and n = 200.

To investigate the effect of the robustness quantile level τ for the adaptive Huber loss, we specified quantile levels corresponding to the efficiency criterion limits for robustness thresholds k = 1 and k = 2 (see Eq. (4)). In addition, we also examined the popular and established threshold value of k = 1.345 [31]. The considered thresholds k correspond to robustness quantile levels τ = 0.683 (k = 1), τ = 0.821 (k = 1.345) and τ = 0.954 (k = 2). For the Bisquare loss, besides the robustness quantile level τ = 0.99 that approximately corresponds to the established threshold k = 4.685, we also considered the robustness quantile levels τ = 0.95 and τ = 0.90. In practice, this can be interpreted as an assumed amount of corruption π of 1%, 5% and 10% of the outcome data, respectively, as discussed in Section 2.3. For all settings, we fitted gradient boosting models with the adaptive robust loss functions as well as with the L 2 and L 1 losses, using a 25-fold bootstrap approach to tune the stopping iteration m stop for each boosting model on the training data. Thereby, the training data are sampled with replacement, observations inside the bootstrap sample are used to fit the model for a grid of iterations until a given maximal iteration m and the remaining observations that have not been sampled are used for evaluating the predictive risk. The best stopping iteration is chosen as the iteration that minimizes the mean predictive risk over 25 bootstrap samples [32]. To put the results into perspective, for the low-dimensional settings we also computed the classical least squares estimator, as well as the classical robust M-estimators for the Huber and Bisquare losses. As comparison in all simulations the lasso was calculated using cv.glmnet() (package glmnet) and also a robust non-sparse version of the lasso specifying rq(model, method = "lasso") (package quantreg). For component-wise gradient boosting we used the implementation provided in the R add-on package mboost, while for the classical estimation approaches we applied the lm() and rlm() (package MASS) functions.

3.2 Prediction accuracy

The average prediction performance of the considered methods in terms of MAEP are provided in Table 1 for all settings. Boxplots of MAEPs for the specific scenario with a corruption rate of 15% are shown in Figure 4.

Table 1:

Average mean absolute error of prediction (MAEP) and standard deviation (SD) for the different methods based on B = 1, 000 simulation runs for different types and amounts of corruption for the low-dimensional (p = 50 and n = 200), high-dimensional (p = 500 and n = 400) and ultra-high-dimensional (p = 15,000 and n = 200) settings, where p refers to the number of explanatory variables and n to the number of training observations, evaluated on 1,000 non-corrupted test data points. For each setting, the three lowest MAEP values are printed in bold.

Type of corruption Non-systematic Systematic
Amount of corruption 0% 7.5% 15% 7.5% 15%
Low-dimensional LS/M-estimation:
L 2 loss 0.926 (0.034) 1.055 (0.074) 1.172 (0.092) 1.427 (0.119) 1.795 (0.173)
Huber loss 0.935 (0.037) 0.965 (0.043) 1.014 (0.054) 0.992 (0.052) 1.118 (0.093)
Bisquare loss 0.940 (0.039) 0.960 (0.042) 1.003 (0.053) 0.942 (0.039) 0.959 (0.042)
Lasso:
L 2 loss 0.843 (0.027) 0.893 (0.045) 0.941 (0.060) 1.042 (0.080) 1.194 (0.124)
L 1 loss 0.859 (0.032) 0.869 (0.036) 0.881 (0.038) 0.872 (0.037) 0.891 (0.041)
Classic boosting:
L 2 loss 0.841 (0.026) 0.890 (0.045) 0.937 (0.059) 1.036 (0.075) 1.190 (0.124)
L 1 loss 0.866 (0.033) 0.878 (0.039) 0.893 (0.043) 0.883 (0.041) 0.905 (0.046)
Boosting Huber:
τ = 0.954 (k = 2) 0.842 (0.027) 0.860 (0.032) 0.903 (0.048) 1.022 (0.071) 1.188 (0.121)
τ = 0.821 (k = 1.345, 0.845 (0.027) 0.855 (0.030) 0.871 (0.034) 0.862 (0.033) 0.907 (0.050)
def.)
τ = 0.683 (k = 1) 0.850 (0.028) 0.860 (0.031) 0.873 (0.033) 0.867 (0.034) 0.888 (0.038)
Boosting Bisquare:
τ = 0.99 (def.) 0.864 (0.033) 0.854 (0.030) 0.873 (0.036) 0.860 (0.034) 0.914 (0.058)
τ = 0.95 0.890 (0.040) 0.875 (0.038) 0.869 (0.035) 0.850 (0.029) 0.877 (0.041)
τ = 0.90 0.924 (0.047) 0.912 (0.048) 0.898 (0.044) 0.884 (0.040) 0.861 (0.032)
High-dimensional Lasso:
L 2 loss 0.839 (0.023) 0.887 (0.035) 0.930 (0.046) 1.024 (0.059) 1.161 (0.087)
L 1 loss 0.861 (0.026) 0.871 (0.028) 0.881 (0.030) 0.875 (0.029) 0.892 (0.033)
Classic boosting:
L 2 loss 0.837 (0.023) 0.883 (0.034) 0.925 (0.044) 1.036 (0.075) 1.190 (0.124)
L 1 loss 0.860 (0.027) 0.871 (0.030) 0.881 (0.033) 0.883 (0.041) 0.905 (0.046)
Boosting Huber:
τ = 0.954 (k = 2) 0.839 (0.023) 0.857 (0.027) 0.896 (0.036) 1.022 (0.071) 1.188 (0.121)
τ = 0.821 (k = 1.345, 0.843 (0.024) 0.857 (0.027) 0.870 (0.029) 0.862 (0.033) 0.907 (0.050)
def.)
τ = 0.683 (k = 1) 0.877 (0.041) 0.895 (0.043) 0.902 (0.040) 0.867 (0.034) 0.888 (0.038)
Boosting Bisquare:
τ = 0.99 (def.) 0.857 (0.026) 0.850 (0.026) 0.865 (0.029) 0.860 (0.034) 0.914 (0.058)
τ = 0.95 0.885 (0.032) 0.869 (0.030) 0.860 (0.029) 0.850 (0.029) 0.877 (0.041)
τ = 0.90 0.922 (0.039) 0.906 (0.037) 0.889 (0.035) 0.884 (0.040) 0.861 (0.032)
Ultra-high-dimensional Lasso:
L 2 loss 1.020 (0.057) 1.184 (0.098) 1.311 (0.121) 1.550 (0.125) 1.701 (0.110)
L 1 loss 1.089 (0.073) 1.134 (0.083) 1.192 (0.099) 1.169 (0.096) 1.289 (0.136)
Classic boosting:
L 2 loss 1.022 (0.058) 1.177 (0.096) 1.294 (0.112) 1.505 (0.103) 1.654 (0.084)
L 1 loss 1.194 (0.091) 1.247 (0.103) 1.306 (0.114) 1.291 (0.116) 1.399 (0.136)
Boosting Huber:
τ = 0.954 (k = 2) 1.028 (0.060) 1.105 (0.077) 1.220 (0.103) 1.467 (0.099) 1.650 (0.082)
τ = 0.821 (k = 1.345, 1.041 (0.062) 1.092 (0.075) 1.159 (0.092) 1.150 (0.095) 1.394 (0.156)
def.)
τ = 0.683 (k = 1) 1.058 (0.066) 1.105 (0.077) 1.161 (0.092) 1.147 (0.092) 1.278 (0.132)
Boosting Bisquare:
τ = 0.99 (def.) 1.116 (0.074) 1.097 (0.079) 1.167 (0.094) 1.164 (0.110) 1.421 (0.157)
τ = 0.95 1.258 (0.097) 1.223 (0.112) 1.218 (0.120) 1.076 (0.079) 1.271 (0.147)
τ = 0.90 1.431 (0.122) 1.407 (0.129) 1.397 (0.143) 1.220 (0.100) 1.157 (0.110)
Figure 4: 
Prediction accuracy of the different methods in terms of MAEP based on B = 1,000 simulation runs for 15% non-systematic corruption (left) and 15% systematic corruption (right) in the low-dimensional (p = 50, n = 200; top), high-dimensional (p = 500, n = 400; middle) and ultra-high-dimensional (p = 15,000, n = 200; bottom) settings, where p refers to the number of explanatory variables and n to the number of observations. MAEP refers to mean absolute error of prediction based on 1,000 uncorrupted test data observations per run.
Figure 4:

Prediction accuracy of the different methods in terms of MAEP based on B = 1,000 simulation runs for 15% non-systematic corruption (left) and 15% systematic corruption (right) in the low-dimensional (p = 50, n = 200; top), high-dimensional (p = 500, n = 400; middle) and ultra-high-dimensional (p = 15,000, n = 200; bottom) settings, where p refers to the number of explanatory variables and n to the number of observations. MAEP refers to mean absolute error of prediction based on 1,000 uncorrupted test data observations per run.

For the low-dimensional setting (with p = 50 and n = 200) with non-systematic corruption, the boosting models and the lasso approaches tended to yield a better prediction accuracy than the classical LS- or M-estimation approaches. As expected, boosting and the lasso behaved very similar both for the L 1 and L 2 loss. For the low-dimensional setting with systematic corruption, only the Bisquare loss with classical M-estimation provided a prediction accuracy which is comparable to the robust boosting approaches. Even though the adaptive Huber loss with robustness quantile level τ = 0.954 (corresponding to the robustness threshold k = 2) led to the highest MAEP of the adaptive robust boosting losses for systematic corruption, its performance for non-corrupted data regarding prediction accuracy is comparable to the L 2 boosting approach and even better for non-systematic corruption. Except for the 15% systematic corruption, boosting with the adaptive Huber loss using the default robustness quantile level τ = 0.821 performed best or close to the best methods in each setting. In situations with corrupted data, at least one choice for the robustness quantile level in the adaptive Bisquare losses yielded the best performance of all methods, e.g. Bisquare with robustness quantile level τ = 0.95 performed best for 15% non-systematic and 7.5% systematic corruption.

Similar results were obtained in the high-dimensional settings (with p = 500 and n = 400). The Huber loss with default robustness quantile level performed at least third best regarding prediction performance for all high-dimensional corrupted settings except for the case of 15% systematic corruption, where the Bisquare loss with robustness quantile level τ = 0.90 yielded the lowest average MAEP. The L 1 loss for boosting and the lasso resulted in slightly worse predictions than the best approaches of each setting. As expected, the original L 2 boosting approach performed best for non-corrupted data.

For the ultra-high-dimensional setting (with p = 15,000 and n = 200) all methods behaved in the same manner as for the other settings to the different amounts and kinds of corruption, while leading in general to an overall lower prediction accuracy than for the scenarios with smaller p.

In general, we observed a competitive prediction performance for all boosting loss functions except for the L 2 loss and the Huber loss with k = 2 (τ = 0.954) in cases of larger amounts of corrupted observations, particularly with systematic corruption. The lasso led to a similar prediction accuracy as boosting in all settings for both considered loss functions. Also the sensitivity analysis with Topelitz correlation structure and non-symmetric corruption led to overall similar results (cf. to Online supplement Table S3 and Table S7).

3.3 Variable selection properties

To investigate the variable selection properties of boosting in combination with adaptive robust loss functions, we evaluated the true positive rate (sensitivity) and true negative rate (specificity) for all settings (summarized in Table 2) for the 6 informative variables and p − 6 non-informative variables.

Table 2:

Mean true positive rate/mean true negative rate (also known as sensitivity/specificity) of the different methods for all simulation settings based on B = 1,000 simulation runs. Low-dimensional refers to p = 50 explanatory variables with 6 informative variables and n = 200 observations; high-dimensional refers to p = 500 with 6 informative variables and n = 400; ultra-high-dimensional refers to p = 15,000 with 6 informative variables and n = 200. Among all methods with a mean true positive rate of 1, the three results with the highest mean true negative rate are printed in bold. For the ultra-high-dimensional setting the three highest mean true positive rates amount all methods with a true negative rate of 1 are printed in bold.

Type of corruption Non-systematic Systematic
Amount of corruption 0% 7.5% 15% 7.5% 15%
Low-dimensional Lasso:
L 2 loss 1.00/0.72 0.99/0.73 0.97/0.73 0.91/0.75 0.82/0.77
L 1 loss 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00
Classic boosting:
L 2 loss 1.00/0.70 0.99/0.74 0.97/0.76 0.91/0.79 0.79/0.84
L 1 loss 1.00/0.67 1.00/0.68 0.99/0.68 1.00/0.68 0.99/0.69
Boosting Huber:
τ = 0.954 (k = 2) 1.00/0.49 1.00/0.55 0.99/0.60 0.93/0.64 0.80/0.72
τ = 0.821 (k = 1.345, def.) 1.00/0.66 1.00/0.65 1.00/0.66 1.00/0.64 0.99/0.74
τ = 0.683 (k = 1) 1.00/0.59 1.00/0.59 1.00/0.59 1.00/0.56 1.00/0.61
Boosting Bisquare:
τ = 0.99 (def.) 1.00/0.67 1.00/0.69 1.00/0.72 1.00/0.78 0.98/0.86
τ = 0.95 1.00/0.60 1.00/0.66 1.00/0.67 1.00/0.68 1.00/0.79
τ = 0.90 0.99/0.49 0.99/0.55 1.00/0.60 1.00/0.64 1.00/0.72
High-dimensional Lasso:
L 2 loss 1.00/0.94 1.00/0.94 0.98/0.94 0.91/0.95 0.80/0.96
L 1 loss 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00
Classic boosting:
L 2 loss 1.00/0.94 1.00/0.95 0.98/0.95 0.90/0.96 0.78/0.97
L 1 loss 1.00/0.92 1.00/0.92 1.00/0.92 1.00/0.92 1.00/0.92
Boosting Huber:
τ = 0.954 (k = 2) 1.00/0.82 1.00/0.86 1.00/0.89 0.94/0.90 0.79/0.95
τ = 0.821 (k = 1.345, def.) 1.00/0.90 1.00/0.89 1.00/0.89 1.00/0.88 1.00/0.93
τ = 0.683 (k = 1) 1.00/0.78 1.00/0.77 1.00/0.79 1.00/0.75 1.00/0.84
Boosting Bisquare:
τ = 0.99 (def.) 1.00/0.92 1.00/0.94 1.00/0.95 1.00/0.96 0.99/0.98
τ = 0.95 1.00/0.88 1.00/0.91 1.00/0.93 1.00/0.94 1.00/0.97
τ = 0.90 1.00/0.82 1.00/0.86 1.00/0.89 1.00/0.90 1.00/0.95
Ultra-high-dimensional Lasso:
L 2 loss 0.94/1.00 0.79/1.00 0.69/1.00 0.45/1.00 0.16/1.00
L 1 loss 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00 1.00/0.00
Classic boosting:
L 2 loss 0.95/1.00 0.80/1.00 0.71/1.00 0.52/1.00 0.22/1.00
L 1 loss 0.88/0.99 0.84/0.99 0.79/0.99 0.80/1.00 0.70/0.99
Boosting Huber:
τ = 0.954 (k = 2) 0.94/1.00 0.86/1.00 0.77/1.00 0.57/1.00 0.23/1.00
τ = 0.821 (k = 1.345, def.) 0.94/1.00 0.88/1.00 0.82/1.00 0.83/1.00 0.62/1.00
τ = 0.683 (k = 1) 0.93/0.99 0.89/0.99 0.82/0.99 0.85/0.99 0.74/1.00
Boosting Bisquare:
τ = 0.99 (def.) 0.91/0.99 0.88/1.00 0.80/1.00 0.79/1.00 0.54/1.00
τ = 0.95 0.83/0.99 0.84/0.99 0.80/0.99 0.89/1.00 0.68/1.00
τ = 0.90 0.72/0.98 0.74/0.99 0.71/0.99 0.84/0.99 0.80/1.00

The mean true positive rates for all robust boosting approaches for the high-dimensional setting were 1.00, except for the Bisquare loss with τ = 0.99 for 15% systematic corrupted data (0.99) and also the Huber loss with τ = 0.954 for systematic corruption, which performs similar to the L 2 loss for boosting and lasso. The same was true for the low-dimensional setting, where all other robust approaches have a true positive rate of at least 0.99. Boosting with the L 2 loss started also with 1.00 for non-corrupted data, but with rising amounts of corruption dropped to 0.97 for non-systematic and to 0.79 for 15% systematic corrupted data.

The true negative rate generally differs between the low- and high-dimensional settings, since a selected variable in the low-dimensional setting has a bigger impact on the mean percentage number of (falsely) selected variables, resulting in generally lower rates for the low-dimensional setting. For 0% corruption, boosting with the L 2 loss led to a mean true negative rate of 0.70, while the mean true negative rates for the robust approaches ranged between 0.49 and 0.67, with higher rates for the default Huber (0.66) and Bisquare (0.67) losses. For corrupted data situations, the L 2 loss still yielded at least the second highest rate, with higher rates for higher amounts of corruption and also for systematic data corruption. This applies also for most robust loss functions, where boosting with the Bisquare loss using the robustness quantile level τ = 0.99 yielded a mean true negative rate of 0.86 for 15% systematic corruption, which is the highest for this setting.

The high-dimensional settings showed generally similar results regarding the true negative rate. The standard L 2 boosting approach yielded at least a mean true negative rate of 0.94, which is close to highest in all settings. Besides the non-adaptive L 1 approach, the adaptive Bisquare loss with default robustness quantile level as well as with quantile level τ = 0.95 resulted also in quite high true negative rates, whose means range from 0.92 to 0.98 and from 0.88 to 0.97, respectively. Using lower robustness quantile levels resulted in a higher variability for different settings with a minimum mean true negative rate of 0.75 (Huber loss with k = 1).

Due to the large number of parameters p = 15,000 for the ultra-high-dimensional setting the mean true negative rates are rather close to 1. The true positive rate decreased drastically for the non-robust L 2 loss function for lasso and boosting from 0% corruption over non-systematic to 15% systematic corruption from 0.94 to 0.16 and 0.95 to 0.22, respectively. The least robust Huber loss (k = 2) performed still comparably good for non-systematic corruption, but ended also with a mean true positive rate of 0.23 for 15% systematic corruption. In contrast the more robust Huber loss with default quantile or τ = 0.683 performed for both kinds of corruption well, as does the Bisquare loss, especially for systematic corruption.

In comparison to the other losses, boosting with the L 2 loss resulted in a higher mean true negative rate for the same simulation settings. This is due to a relatively early stopping of the algorithm in case of corrupted data situations, as the tuning of m stop via resampling on the corrupted training data detects faster overfitting. The classical lasso with L 2 loss behaved again very similar to the L 2 boosting approach for each kind and each amount of corruption and also respectively for the low-, high- and ultra-high-dimensional setting. Regarding variable selection L 1 lasso was not comparable due to non-sparse model fits. The positive predictive value and negative predictive value for the simulations can be found in Table S1 in the supplement, where also the results of the sensitivity analyses (see Table S4, S5 and Table S8, S9) are reported, which led overall to comparable results to the original scenarios.

3.4 Interpretation of simulation results

Our simulations indicate that the LS/M-estimators tend to overfit the training data even in the relatively low-dimensional setting with p = 50 and n = 200, as no regularization or variable selection is incorporated. As expected, this can be already observed in settings without any corrupted observations.

This is not the case for the considered boosting approaches, where particularly the standard L 2 loss resulted in a high mean true negative rate for boosting and the lasso. Even though this is a major advantage for non-corrupted data situations, it also led to lower mean true positive rates particularly in systematically corrupted settings in comparison to all robust boosting approaches. Additionally, the quadratic influence of the L 2 loss and its sensitivity to (systematic) corruption led to higher MAEPs, whereas for larger amounts of corrupted observations and both types of corruptness, the Bisquare loss still worked quite well. A plausible reason for this is the bounded structure of the loss function, which drastically reduces the influence of large outliers.

For such cases with corrupted data the performance for adaptive robust loss functions (Huber and Bisquare) from the boosting approach are very promising. Competitive results were obtained by the default robustness quantile levels of the loss functions, which performed robust under all considered scenarios. However, prior knowledge about the true amount of corrupted observations can be advantageous as the robustness quantile level may be specified accordingly: one may choose a lower robustness quantile level τ for the composite robust losses when the amount of potentially corrupted observations is high. In particular, the adaptive Huber loss function performed best for non-systematic corruption, but could lead to higher MAEP in settings where the chosen robustness quantile level τ deviates too much from the underlying amount of systematic corruption (e.g., Huber k = 2, τ = 0.954 with 15% systematic corruption), so that even the Huber loss turned out to be not robust enough anymore.

The adaptive Bisquare loss performed well for different types of corruption, particularly for systematic corruption and higher amounts of corrupted observations. Using the (default) robustness quantile level (τ = 0.99) is a reasonable choice if one has no information on outliers or the amount of corruption. If the proportion of corrupted observations was high, adjusting the robustness quantile level τ (lowering it to 0.95 or even 0.90) still led to a very competitive prediction performance and increased the true positive rate while slightly reducing the true negative rate.

However, for the adaptive Huber loss, in practical situations without information on how many observations could be outliers, we recommend to keep the default robustness quantile level τ – even if that means underestimating the amount of corruption in some situations. When using the adaptive Bisquare loss in boosting, the advantages are particularly evident for lower quantile levels (e.g., τ = 0.95), although for uncorrupted data this could lead to a lower prediction accuracy compared to the L 2 loss.

All considered sensitivity analysis on different correlation structure or non-symmetric corruption (see online-supplement) led to overall similar results in comparison to the presented scenarios in the main manuscript.

In practical prediction modelling, robust loss functions could at least serve for sensitivity analyses even when the data are not assumed to be contaminated. Particularly in our considered high-dimensional and ultra-high-dimensional data situations, boosting in combination with the adaptive robust loss functions always yielded a reasonable performance regarding prediction accuracy and the selection of informative variables – no matter if or how much the outcome was corrupted.

4 Protein expression data of the NCI-60 cancer cell panel

As a result of computational advances and the establishment of large cohort studies, many (ultra-)high-dimensional data applications can nowadays be found in the areas of genomics and proteomics [37, 38]. In this context, sparser and robust models can enhance the interpretability and can point to a subset of variants, genes or proteins for further investigation.

To illustrate our approach, we considered an NCI-60 data set with 59 human cancer cell lines (cf. [39]), which is publicly available at https://discover.nci.nih.gov/cellminer/. Data based on this cancer cell panel has already been analyzed using various alternative robust approaches [40, 41], machine learning algorithms [42, 43] and variable selection methods [44, 45]. As potential predictors we chose the gene expression measurements from the Affymetrix HG-U133A-B chip, which are normalized using the guanine cytosine robust multi-array analysis.

In case of multiple measurements of the same genes for one sample we considered the median gene expression measurement, leading to p = 14, 951 explanatory variables. As outcome variable we considered the original KRT19 protein antibody array of the “Protein: Antibody Array DTB” set [39]. KRT19 has the largest kurtosis in comparison to the other available protein expression data [39], which makes it a particular interesting data example for robust regression approaches.

As a measure of predictive accuracy, we computed the mean absolute error of prediction (MAEP) via leave-one-out cross validation for n = 59 observations using successively 58 observations as training data and the omitted observation as test data. This results in fitting 59 models, each of which predicts one observation generating one absolute error. Additionally, we compared the selection frequencies of genes for different settings for the boosting families presented in Section 2.3 using different robustness quantile levels τ. As base-learners we incorporated separate linear models for each of the p = 14, 951 candidate genes. Tuning of the stopping iteration was performed in the same way as in Section 3 via 25-fold bootstrapping. More details and reproducible code can be found on GitHub (https://github.com/JSpBonn/Robust-losses-Statistical-Boosting). Beside the comparison to the same lasso approaches as in Section 3, we used sparseLTS() (sparse least trimmed squares regression, [40]) and rgrplars() (robust groupwise least angle regression, [46]) as additional robust methods for comparison. These functions were applied with default settings available in the R package robustHD [47].

Taking the large number of predictor variables into account, there were only rather small subsets of genes selected by the component-wise gradient boosting algorithms (Figure 5), resulting in sparse models. Comparing the average sizes of the models for different loss functions from Table 3, classical boosting with the L 2 and L 1 losses already resulted in sparse models containing on average 6.83 and 8.93 genes, respectively. Even sparser models resulted from the quantile-based adaptive Huber and Bisquare loss functions with a maximum average of 2.95 selected variables for the Bisquare loss with robustness quantile level τ = 0.90. Here, a larger robustness quantile level τ led to sparser and more interpretable models.

Figure 5: 
Barplots of selection frequencies of genes in a leave-one-out cross-validation scheme (n = 59) on the NCI-60 cancer cell-line data for different loss functions in combination with component-wise gradient boosting and linear base-learners. The outcome was the KRT19 protein, potential explanatory variables were the (normalized) gene expression measurements of p = 14,951 genes. Only genes are displayed which were at least five times selected for at least one loss function; the ordering on the x-axis is the same for all loss-functions.
Figure 5:

Barplots of selection frequencies of genes in a leave-one-out cross-validation scheme (n = 59) on the NCI-60 cancer cell-line data for different loss functions in combination with component-wise gradient boosting and linear base-learners. The outcome was the KRT19 protein, potential explanatory variables were the (normalized) gene expression measurements of p = 14,951 genes. Only genes are displayed which were at least five times selected for at least one loss function; the ordering on the x-axis is the same for all loss-functions.

Table 3:

Mean absolute error (MAE), mean number of selected variables and mean computational time in seconds per run and the respective standard deviations (SD) for leave-one-out cross-validation on the NCI-60 data using boosting with different loss functions and two lasso approaches. Competitor rgrplars refers to robust groupwise least angle regression and sparseLTS refers to sparse least trimmed squares regression [47]. The three lowest MAEs are printed in bold.

Method MAE (SD) Mean Size (SD) Mean Time (SD)
Lasso:
L 2 loss 33.82 (51.91) 2.36 (2.88) 1.03 (0.05)
L 1 loss 35.99 (47.81) 14,951.00 (0.00) 8509.11 (1475.78)
Classic boosting:
L 2 loss 36.32 (35.13) 6.83 (5.12) 20.14 (0.90)
L 1 loss 27.93 (51.38) 8.93 (2.84) 20.17 (1.14)
Boosting Huber:
τ = 0.954 (k = 2) 28.34 (50.00) 1.69 (0.84) 23.83 (3.20)
τ = 0.821 (k = 1.345, def.) 26.98 (48.61) 1.88 (1.20) 21.50 (0.72)
τ = 0.683 (k = 1) 27.69 (49.92) 2.46 (3.54) 39.76 (1.73)
Boosting Bisquare:
τ = 0.99 (def.) 28.61 (50.72) 0.85 (0.69) 22.20 (2.46)
τ = 0.95 28.17 (51.32) 1.47 (0.60) 21.77 (2.13)
τ = 0.90 28.45 (51.47) 2.95 (1.88) 21.77 (1.85)
Robust competitors:
rgrplars() 28.08 (51.56) 3.78 (1.49) 237.6 (209.79)
sparseLTS() 28.27 (48.88) 42.34 (0.96) 13410.24 (120.29)

While there were differences between the sets of selected variables for different robustness quantile levels τ, there was also an overlap within each adaptive robust loss function comparing different robustness quantile levels. For example, all chosen robustness quantile levels τ for Huber have selected the KRT19 gene-expression variable frequently and also EPS8L2, which was also the case for the Bisquare loss with τ = 0.99. Additionally, also ARHGEF5 was selected often in the on average slightly larger Huber models of the default robustness threshold and k = 1. For the other Bisquare losses with robustness quantile levels τ ∈ {0.95, 0.90}, the MORC4 gene had the highest overlap.

Considering the ultra-high-dimensionality of this application, the variable selection results appeared to be rather insensitive to the choice of the robustness quantile level. Comparable findings could be observed regarding the prediction performance. The adaptive robust loss functions led to a similar range for the MAE, where the Huber loss with default robustness quantile level τ led to the lowest prediction error (MAE = 26.98). In this example, the L 2 loss was outperformed by all other considered loss functions, so were the lasso approaches (the L 1 loss might overfit the training data due to not implemented sparsity). The prediction of the competing robust penalized regression approaches (sparse least trimmed squares regression, groupwise least angle regression) were within the range of the robust boosting approaches. The default robustness quantile level for the adaptive Huber loss was also performing well for this skewed outcome, but in such cases with unknown, even non-symmetric error distributions it might also be beneficial to choose a lower robustness quantile level τ. The running time is within the boosting context really comparable, while the classical L 2 lasso is faster. The L 1 lasso needs more time due to high number of covariates as does the sparseLTS() function.

In summary, the different adaptive robust loss functions tended to select different variables for their prediction models; however, the final models were comparably sparse and internally consistent regarding the selection of the robustness quantile level τ.

5 Discussion

This article proposed adaptive robust loss functions to fit statistical models with component-wise gradient boosting based on a fixed quantile level to separate the two parts of composite losses. We reported promising results regarding the robustness against outliers in the response variable, providing evidence that our approach allows to select and estimate robust prediction models even in the context of high-dimensional (p > n) data. In a simulation study with low-, high-dimensional and ultra-high-dimensional settings we were able to show that the adaptive Huber and Bisquare losses are advantageous for regression situations where the outcome contains outliers or is affected by specific types of corruption. Additionally, in the presence of non-corrupted observations, they also led to a similar model performance compared to boosting with the efficient L 2 loss. For non-systematic corruption (see Figure 3), the adaptive Huber loss showed the best prediction accuracy, whereas the adaptive Bisquare loss was favorable in cases with systematic corruption. The ability of boosting to select sparse and interpretable models for high-dimensional data is also preserved.

In the analysis of cancer cell lines in Section 4, representing an ultra-high-dimensional setting with skewed responses, the algorithm selected very sparse models with the proposed robust losses, while leading to more accurate predictions compared with standard loss functions for boosting, but also in comparison to the lasso.

An earlier gradient boosting approach for the Huber loss from Lutz et al. [28] had focused on robustified (linear) base-learners, effectively correcting for potentially high leverage points in the explanatory variables, while iteratively applying a standardisation rule (scaling) for the residuals. But also very recent work [29] considered robust gradient boosting for linear regression. In contrast to our approach, a two step procedure was followed: first, an initial non-parametric robust scale estimate using regression trees is employed in each iteration before an M-type estimate is updated.

In contrast to previous approaches, using the quantile level τ presented here does not rely on an additional scale estimation for the residuals or auxiliary scale tuning for every iteration. Adapting k [m] for iterations 1, … , m stop while fixing the quantile level τ seems to be a simple but natural approach, since for a given application the amount of corruption in the response does not change during the modelling process. Hence, in our approach also the amount of observations falling in the inner or outer part of robust composite losses stays the same.

Currently, we have only focused on the combination of adaptive robust loss functions with linear base-learners, leading to robust linear regression models. However, due to the modular structure of the algorithm [9], these new loss functions could also be easily combined with any other type of existing base-learners (cf. [32]). Furthermore, we only addressed robustness with respect to contaminated outcomes (vertical corruption): in future work, our approach could be further extended to contaminated covariates (e.g., high leverage points/horizontal outliers) in a similar manner as in [28]. In general, using the modular structure of the boosting algorithm, base-learner with additional penalization (not only ordinary least squares estimates) can be combined with the robust adaptive loss functions to take vertical and horizontal outliers into account in one method. This flexibility when it comes to extensions is one major advantage of the algorithm in comparison to, e.g. lasso-based methods, which have instead tend to lead to better running times, while achieving similar predictive performance for classic loss functions.

An inherited limitation of the gradient boosting approach is the lack of standard errors for effect estimates, making the classical construction of confidence intervals or hypothesis tests problematic. A potential solution to this issue are computational workarounds like permutation tests [48, 49] or resampling-based confidence intervals [50], which were also investigated for parts of our simulation study (see online supplement Table S11). Another limitation, this time inherited from the field of robust regression, is that the composite robust loss functions we investigated here require an initial choice for the robustness parameter. In our case this is reflected by the robustness quantile level τ which implicitly adapts to the current residuals throughout the boosting iterations. As the quantile level τ is controlling the robustness of the final model, it could be specified based on information or assumptions about the amount of outliers or corrupted observations in the response. As this is usually not known in practice, from a computational perspective one could also treat the quantile level τ as an additional tuning parameter, and optimize it via resampling or cross-validation to refit a final model based on the optimized τ value. Furthermore, it might be also of interest to investigate the influence of the adaptive k [m] on the optimal stopping iteration or the effective learning rate, respectively, and its effect on the resulting model size and regularization.

However, the results of our simulations and the protein-expression application suggest that all considered τ values for the adaptive Huber and Bisquare losses worked reasonably well regarding prediction performance and variable selection. We hence do not propose an additional tuning of τ but recommend using the default fixed quantile levels for the Huber (τ = 0.821) and Bisquare (τ = 0.99) loss in the absence of knowledge about outliers or corrupted observations.

Future research perspectives in this context include the combination of automated outlier detection methods to specify the robustness quantile level τ, e.g. for the Bisquare loss. Another potential extension of our approach is to combine component-wise gradient boosting with M-quantile regression, an extension of quantile and expectile regression using robust M-estimation for a generalization of median and mean regression, respectively. Here the Huber loss function can be employed to estimate robust models for specific quantiles of the response distribution, leading to non-symmetric loss functions (cf. [51]). Also an extension of robust boosting methods towards other types of regression models, e.g. distributional regression [52, 53], might be worth investigating (cf. [54]).

In terms of practical prediction modelling for biomedical applications, it always makes sense to check various modelling options, including robust approaches as the ones proposed in this work and to compare their results with respect to the objectives of the analysis.

Particularly for high-dimensional data with potentially corrupted outcomes, statistical boosting with quantile-based adaptive losses is able to estimate and select interpretable prediction models that are both robust and sparse while yielding a competitive prediction accuracy.


Corresponding author: Jan Speller, Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany, E-mail:

Acknowledgement

The authors gratefully acknowledge fruitful discussions with Benjamin Hofner on the methodological concepts of this article.

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: None declared.

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

1. Barrios, EB. Robustness, data analysis, and statistical modeling: the first 50 years and beyond. Commun Stat Appl Methods 2015;22:543–56. https://doi.org/10.5351/csam.2015.22.6.543.Search in Google Scholar

2. Huber, PJ. Robust statistics. New York: John Wiley & Sons; 1981.10.1002/0471725250Search in Google Scholar

3. Maronna, RA, Martin, RD, Yohai, VJ, Salibián-Barrera, M. Robust statistics: Theory and methods (with R), 2nd ed. New York: John Wiley & Sons; 2019.10.1002/9781119214656Search in Google Scholar

4. Susanti, Y, Pratiwi, H, Sulistijowati, S, Liana, T. M estimation, S estimation, and MM estimation in robust regression. Int J Pure Appl Math 2014;91:349–60. https://doi.org/10.12732/ijpam.v91i3.7.Search in Google Scholar

5. Fan, J, Lv, J. A selective overview of variable selection in high dimensional feature space. Stat Sin 2010;20:101–48.Search in Google Scholar

6. Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55–67. https://doi.org/10.1080/00401706.1970.10488634.Search in Google Scholar

7. Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar

8. Bühlmann, P, Hothorn, T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci 2007;22:477–505. https://doi.org/10.1214/07-sts242.Search in Google Scholar

9. Mayr, A, Hofner, B. Boosting for statistical modelling: a non-technical introduction. Stat Model Int J 2018;18:365–84. https://doi.org/10.1177/1471082x17748086.Search in Google Scholar

10. Mayr, A, Hofner, B, Schmid, M. The importance of knowing when to stop. Methods Inf Med 2012;51:178–86. https://doi.org/10.3414/ME11-02-0030.Search in Google Scholar PubMed

11. Schmid, M, Hothorn, T. Boosting additive models using component-wise P-splines. Comput Stat Data Anal 2008;53:298–311. https://doi.org/10.1016/j.csda.2008.09.009.Search in Google Scholar

12. Griesbach, C, Säfken, B, Waldmann, E. Gradient boosting for linear mixed models. Int J Biostat 2021;17:317–29. https://doi.org/10.1515/ijb-2020-0136.Search in Google Scholar PubMed

13. Kneib, T, Hothorn, T, Tutz, G. Variable selection and model choice in geoadditive regression models. Biometrics 2009;65:626–34. https://doi.org/10.1111/j.1541-0420.2008.01112.x.Search in Google Scholar PubMed

14. Hepp, T, Zierk, J, Rauh, M, Metzler, M, Mayr, A. Latent class distributional regression for the estimation of non-linear reference limits from contaminated data sources. BMC Bioinf 2020;21:524. https://doi.org/10.1186/s12859-020-03853-3.Search in Google Scholar PubMed PubMed Central

15. Thomas, J, Mayr, A, Bischl, B, Schmid, M, Smith, A, Hofner, B. Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates. Stat Comput 2018;28. https://doi.org/10.1007/s11222-017-9754-6.Search in Google Scholar

16. Wang, Z. Robust boosting with truncated loss functions. Electron J Stat 2018;12:599–650. https://doi.org/10.1214/18-ejs1404.Search in Google Scholar

17. Fenske, N, Fahrmeir, L, Hothorn, T, Rzehak, P, Höhle, M. Boosting structured additive quantile regression for longitudinal childhood obesity data. Int J Biostat 2013;9:1–18. https://doi.org/10.1515/ijb-2012-0035.Search in Google Scholar PubMed

18. Fenske, N, Kneib, T, Hothorn, T. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. J Am Stat Assoc 2011;106:494–510. https://doi.org/10.1198/jasa.2011.ap09272.Search in Google Scholar

19. Filzmoser, P, Nordhausen, K. Robust linear regression for high-dimensional data: an overview. WIREs Comput Stat 2021;13:e1524. https://doi.org/10.1002/wics.1524.Search in Google Scholar

20. Zhao, J, Yan, G, Zhang, Y. Robust estimation and shrinkage in ultrahigh dimensional expectile regression with heavy tails and variance heterogeneity. Stat Pap 2022;63:1–28. https://doi.org/10.1007/s00362-021-01227-2.Search in Google Scholar

21. Lugosi, G, Mendelson, S. Mean estimation and regression under heavy-tailed distributions: a survey. Found Comput Math 2019;19:1145–90. https://doi.org/10.1007/s10208-019-09427-x.Search in Google Scholar

22. Fan, J, Lv, J. Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc B 2008;70:849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x.Search in Google Scholar PubMed PubMed Central

23. Wang, H. Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 2009;104:1512–24. https://doi.org/10.1198/jasa.2008.tm08516.Search in Google Scholar

24. Maronna, RA. Robust ridge regression for high-dimensional data. Technometrics 2011;53:44–53. https://doi.org/10.1198/tech.2010.09114.Search in Google Scholar

25. Chang, L, Roberts, S, Welsh, A. Robust lasso regression using Tukey’s biweight criterion. Technometrics 2018;60:36–47. https://doi.org/10.1080/00401706.2017.1305299.Search in Google Scholar

26. Deutelmoser, H, Scherer, D, Brenner, H, Waldenberger, M, Suhre, K, INTERVAL study, et al.. Robust Huber-lasso for improved prediction of protein, metabolite and gene expression levels relying on individual genotype data. Briefings Bioinf 2020;22. https://doi.org/10.1093/bib/bbaa230.Search in Google Scholar PubMed PubMed Central

27. Amato, U, Antoniadis, A, De Feis, I, Gijbels, I. Penalised robust estimators for sparse and high-dimensional linear models. Stat Methods Appl 2021;30:1–48. https://doi.org/10.1007/s10260-020-00511-z.Search in Google Scholar

28. Lutz, RW, Kalisch, M, Bühlmann, P. Robustified L2 boosting. Comput Stat Data Anal 2008;52:3331–41. https://doi.org/10.1016/j.csda.2007.11.006.Search in Google Scholar

29. Ju, X, Salibián-Barrera, M. Robust boosting for regression problems. Comput Stat Data Anal 2021;153:107065. https://doi.org/10.1016/j.csda.2020.107065.Search in Google Scholar

30. Huber, PJ. Robust estimation of a location parameter. Ann Math Stat 1964;35:73–101. https://doi.org/10.1214/aoms/1177703732.Search in Google Scholar

31. Yu, C, Yao, W. Robust linear regression: a review and comparison. Commun Stat Simulat Comput 2017;46:6261–82. https://doi.org/10.1080/03610918.2016.1202271.Search in Google Scholar

32. Hofner, B, Mayr, A, Robinzonov, N, Schmid, M. Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 2014;29:3–35. https://doi.org/10.1007/s00180-012-0382-5.Search in Google Scholar

33. Bühlmann, P. Boosting for high-dimensional linear models. Ann Stat 2006;34:559–83. https://doi.org/10.1214/009053606000000092.Search in Google Scholar

34. Hepp, T, Schmid, M, Gefeller, O, Waldmann, E, Mayr, A. Approaches to regularized regression – a comparison between gradient boosting and the lasso. Methods Inf Med 2016;55:422–30. https://doi.org/10.3414/ME16-01-0033.Search in Google Scholar PubMed

35. Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;29:1189–232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar

36. Hothorn, T, Bühlmann, P, Kneib, T, Schmid, M, Hofner, B. mboost: Model-based boosting. R package version 2; 2021. 9-5.Search in Google Scholar

37. Loos, RJF. 15 years of genome-wide association studies and no signs of slowing down. Nat Commun 2020;11:5900. https://doi.org/10.1038/s41467-020-19653-5.Search in Google Scholar PubMed PubMed Central

38. MacArthur, J, Bowler, E, Cerezo, M, Gil, L, Hall, P, Hastings, E, et al.. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res 2016;45:896–901. https://doi.org/10.1093/nar/gkw1133.Search in Google Scholar PubMed PubMed Central

39. Sun, Q, Zhou, W, Fan, J. Adaptive Huber regression. J Am Stat Assoc 2020;115:254–65. https://doi.org/10.1080/01621459.2018.1543124.Search in Google Scholar PubMed PubMed Central

40. Alfons, A, Croux, C, Gelper, S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 2013;7:226–48. https://doi.org/10.1214/12-aoas575.Search in Google Scholar

41. Park, H. Outlier-resistant high-dimensional regression modelling based on distribution-free outlier detection and tuning parameter selection. J Stat Comput Simulat 2017;87:1799–812. https://doi.org/10.1080/00949655.2017.1287186.Search in Google Scholar

42. Archetti, F, Giordani, I, Vanneschi, L. Genetic programming for anticancer therapeutic response prediction using the NCI-60 dataset. Comput Oper Res 2010;37:1395–405. https://doi.org/10.1016/j.cor.2009.02.015.Search in Google Scholar

43. Hemphill, E, Lindsay, J, Lee, C, Măndoiu, II, Nelson, CE. Feature selection and classifier performance on diverse bio- logical datasets. BMC Bioinf 2014;15:S4. https://doi.org/10.1186/1471-2105-15-S13-S4.Search in Google Scholar PubMed PubMed Central

44. Min, EJ, Safo, SE, Long, Q. Penalized co-inertia analysis with applications to -omics data. Bioinformatics 2018;35:1018–25. https://doi.org/10.1093/bioinformatics/bty726.Search in Google Scholar PubMed PubMed Central

45. Wu, C, Zhou, F, Ren, J, Li, X, Jiang, Y, Ma, S. A selective review of multi-level omics data integration using variable selection. High-Throughput 2019;8:4. https://doi.org/10.3390/ht8010004.Search in Google Scholar PubMed PubMed Central

46. Alfons, A, Croux, C, Gelper, S. Robust groupwise least angle regression. Comput Stat Data Anal 2016;93:421–35. https://doi.org/10.1016/j.csda.2015.02.007.Search in Google Scholar

47. Alfons, A. robustHD: an R package for robust regression with high-dimensional data. J Open Source Software 2021;6:3786. https://doi.org/10.21105/joss.03786.Search in Google Scholar

48. Hepp, T, Schmid, M, Mayr, A. Significance tests for boosted location and scale models with linear base-learners. Int J Biostat 2019;15. https://doi.org/10.1515/ijb-2018-0110.Search in Google Scholar PubMed

49. Mayr, A, Schmid, M, Pfahlberg, A, Uter, W, Gefeller, O. A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models. Stat Methods Med Res 2017;26:1443–60. https://doi.org/10.1177/0962280215581855.Search in Google Scholar PubMed

50. Hofner, B, Kneib, T, Hothorn, T. A unified framework of constrained regression. Stat Comput 2016;26:1–14. https://doi.org/10.1007/s11222-014-9520-y.Search in Google Scholar

51. Chambers, R, Dreassi, E, Salvati, N. Disease mapping via negative binomial regression M-quantiles. Stat Med 2014;33:4805–24. https://doi.org/10.1002/sim.6256.Search in Google Scholar PubMed

52. Mayr, A, Fenske, N, Hofner, B, Kneib, T, Schmid, M. Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting. J Roy Stat Soc: Ser C (Appl Stat) 2012;61:403–27. https://doi.org/10.1111/j.1467-9876.2011.01033.x.Search in Google Scholar

53. Rigby, RA, Stasinopoulos, DM. Generalized additive models for location, scale and shape. J Roy Stat Soc: Ser C (Appl Stat) 2005;54:507–54. https://doi.org/10.1111/j.1467-9876.2005.00510.x.Search in Google Scholar

54. Aeberhard, WH, Cantoni, E, Marra, G, Radice, R. Robust fitting for generalized additive models for location, scale and shape. Stat Comput 2021;31:11. https://doi.org/10.1007/s11222-020-09979-x.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/ijb-2021-0127).


Received: 2021-12-01
Revised: 2022-06-09
Accepted: 2022-06-20
Published Online: 2022-08-10

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 22.2.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2021-0127/html
Scroll to top button