Abstract
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially highdimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with baselearners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical Mregression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For noncorrupted data, our approach yielded a similar performance to boosting with the efficient L _{2} loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
1 Introduction
Statistical tools for data analysis and regression are becoming more frequently confronted with large and complex data situations [1]. The availability of data with a rising number of measured variables and observations as well as the generally increasing number of data sources provide many new research possibilities but also challenges. This is particularly true in (bio)medical research, where, for example, genomewide association studies and large cohort data lead to advances in genetic epidemiology. Such observational data often contain outliers or might be corrupted in some way, which is in general not easy to detect.
Highdimensional data situations, where the number of covariates exceeds the number of observations (p > n or even p ≫ n), amplify this difficulty of data analysis in the context of regression. Appropriate tools for these situations have to address the two methodological issues of robustness against potentially corrupted data (e.g., via robust loss functions) and of handling the highdimensionality of the data (e.g., via regularization and variable selection).
Classical least squares regression, for example, is highly sensitive to outliers and even single observations can have a major impact on final model estimates [2, 3]. Many robust regression approaches have been proposed for lowdimensional settings with a relatively small number of covariates (p < n), including Mestimation, Sestimation and MMestimation [4]. However, these classical inference approaches tend to become unfeasible for highdimensional settings (p > n), where additionally some form of datadriven variable selection or regularization needs to be imposed.
In the general context of highdimensional statistical modelling, preselection or screening approaches are often used to reduce the dimensionality of the data before the analysis (for an overview see [5]). Another way to address large numbers of explanatory variables is to employ a regularization approach, such as ridge regression (e.g., [6]) or lasso regression (e.g., [7]). The L _{1} penalization of the coefficient estimates via the lasso is particularly beneficial in this context, as it incorporates variable selection – leading also to a better interpretability of the sparse model.
Another computational concept to fit such sparse and interpretable models is statistical boosting [8], which also provides the basis for our approach. Statistical boosting builds on a concept from machine learning and iteratively updates the estimated regression coefficients using componentwise gradient descent (for a nontechnical introduction of boosting see [9]). An empirical risk function is minimized and the algorithm is stopped after finitely many iterations. The main tuning parameter for statistical boosting is the stopping iteration, which is typically specified via resampling to optimize the predictive performance [10]. Early stopping of the algorithm leads to automatic variable selection and shrinkage of effect estimates, avoiding overfitting and identifying small subsets of covariates (e.g., genetic variants) for closer examination. The concept of statistical boosting has been extended towards different types of statistical models, including generalized additive models [8, 11], linear mixed models [12], geoadditive regression [13], distributional regression [14, 15], categorical regression [16] and quantile regression [17, 18].
Lately, there has also been much research activity regarding developing, enhancing and improving modern robust statistical approaches for highdimensional linear regression (see [19] for a recent overview). In general, the term “robust” has multiple meanings, since it depends strongly on the context and on which parts of the data analysis should be considered “robust”. For example, there is given special focus on heterogeneity for heavytailed distributed errors in regularized expectile regression [20] and for statistical learning problems [21]. Robust versions of screening approaches (e.g., [22, 23]) and of penalized approaches like ridge regression [24] and the lasso [25, 26] have been proposed; furthermore, new approaches for Mestimators motivated by wavelet thresholding [27] have also been considered. Lutz et al. [28] encountered high leverage points in the explanatory variables (also known as horizontal outliers) with several ways of using robustified (linear) baselearners in componentwise L _{2} boosting. Additionally, a robustification for the outcome contains an auxiliary scale estimator to impose a dependency between the current residual sizes and the robustness parameter of the Huber loss. Ju and SalibiánBarrera [29] very recently proposed a two step approach for robust linear regression, applying first L _{1} regression trees as an initialization step for the nonparametric and robust estimation of the residual scale (Stype boosting), followed by gradient boosting (Mtype boosting).
In this work we want to take advantage of the modular structure of the boosting algorithm and incorporate different robust loss functions with respect to outlier corrupted data in the outcome (vertical outliers). For continuous outcome variables we propose a combination of robust linear regression using composite loss functions and a dataadapted variant of statistical gradientboosting, combining their advantages of robustness and sparsity while preserving the interpretability of classical robust regression models. In contrast to the common L _{2} and L _{1} losses, several robust loss functions such as the Huber and Bisquare loss are composed of two parts which are separated based on a threshold parameter k. To distinguish the two parts of composite loss functions all arguments with absolute value smaller or equal to k are considered to be located within the “inner” part. Higher absolute values than k are falling into the “outer” part of the composite loss. We adjust these loss functions by a quantilebased approach, which adapts the threshold parameter in each iteration of the boosting algorithm to the current fit. In contrast to previous robust boosting approaches [28, 29], the proposed quantilebased approach does not require the estimation of an auxiliary scale parameter for the composite loss functions during the boosting process or in an additional initialisation step. Instead, our approach is based on the choice of a fixed quantile level regarding the separation of observations into an amount evaluated by the inner (nonrobust) part and an amount evaluated by the outer (robust) part of the composite loss functions. This results in an adaptive robustness threshold k ^{[m]} based on the sizes of the current residuals after each boosting iteration m.
The remainder of this article is structured as follows: after giving a short overview of robust regression and boosting in Sections 2.1 and 2.2, the new approach with quantilebased adaptive loss functions is presented in Section 2.3. Details of the implementation are shortly discussed in Section 2.4. In Section 3, our approach is illustrated and evaluated in a simulation study, considering different types and amounts of corruption (nonsystematic and systematic) in low and highdimensional settings. The analysis of skewed antibody array data (NCI60 panel) with n = 59 cancer cell lines is presented in Section 4, where for p = 14, 951 gene expression predictors the robust quantilebased adaptive loss functions were applied within the boosting framework to model the KRT19 protein expression outcome. A comparison to the lasso and a robust version of the lasso is provided. The corresponding Rcode implementing our approach, and also the code to reproduce our analysis is available on GitHub (https://github.com/JSpBonn/RobustlossesStatisticalBoosting).
2 Methods
For n observations (x
_{1}, y
_{1}), …, (x
_{
n
}, y
_{
n
}) of p explanatory variables with
with intercept β
_{0}, regression coefficient vector
The estimation of (β
_{0},
β
) can be facilitated by the choice of an appropriate loss function ρ, leading to the empirical risk function
2.1 Robust Regression
We focus on methods for robust regression for data situations with potentially (highly) corrupted values y _{ i } for the outcome variable and uncorrupted values x _{ i } for the explanatory variables. To reduce the influence of potential outliers on model fitting, robust loss functions ρ have been proposed (cf. [27] and the references therein). A standard loss function that leads to robust models is the L _{1} loss ρ(z) = z, corresponding to median regression and yielding the maximum likelihood estimator assuming Laplaciandistributed errors. Compared to the L _{2} loss, the L _{1} loss can result in inefficient estimators in uncorrupted data situations [3]. Additionally, in case there are different amounts, types or magnitudes of corruptness in the data, the L _{1} loss function cannot be adapted.
In order to address these issues, composite loss functions consist of specific segments, the “inner” and “outer” part separated by a specific robustness threshold k ≥ 0, for response observations that deviate strongly from the model fit (large residuals). An important example is the socalled Huber loss [30], which forms a smooth and convex mixture of L _{1} and L _{2}, given by
The choice of the robustness threshold k is essential, as it specifies which observations are evaluated by the outer, robust L
_{1} or inner, nonrobust L
_{2} loss, respectively. It is important to note that for composite loss functions like the Huber loss, the robustness threshold k depends on the scale of the residuals z
_{
i
} = y
_{
i
} − f(x
_{
i
}). Typically, robust approaches like the [normalized] median absolute deviation (MAD[N]) are used for estimating the standard deviation
The solution of (3) is a so called “Mestimator”, a generalization of the maximum likelihood estimator [3].
The typically used robustness threshold k = 1.345 for the Huber loss is derived theoretically to achieve 95% asymptotic efficiency compared to the classical estimator based on the L _{2} loss for the case of noncorrupted data with Gaussian errors [30]. In addition to the idealized Gaussian error, Huber [30] also considered the weaker assumption of a composition of a Gaussian center and a corrupted outer part of the distribution, where the underlying cumulative distribution function (cdf) for the standardized residuals is assumed to be F = (1 − π)Φ + πH, with Φ representing the standard Gaussian cdf and H an undefined symmetric cdf. The proportion π ∈ [0, 1] reflects the amount of corruption in the distribution. Under these model assumptions, Huber [30] showed that the maximal asymptotic variance among all translation invariant estimators is minimized by the Huber loss, when the robustness threshold k is connected to the amount of corruption π via
In this particular case, there exists a unique asymptotically “least informative” or “least favorable” distribution F with Gaussian center and double exponential tails. Equation (4) implies that the smaller the robustness threshold k is chosen for the Huber loss, the larger we implicitly assume the corrupted part πH of the distribution F. Huber recommended a robustness threshold value of k ∈ (1, 2) for standardized residuals, as this choice is not very sensitive regarding the underlying model [30].
The typically used robustness threshold k = 1.345, corresponding to an asymptotic efficiency of 95% under Gaussian errors, can be translated via Eq. (4) to a proportion π = 5.8% of corrupted outcome data. The lower boundary k = 1 corresponds to π = 14.3% corrupted response observations and an asymptotic efficiency of around 90.3% for Gaussian errors, the upper boundary k = 2 corresponds to around π = 0.8% corrupted data with 99% efficiency for Gaussian errors [30]. The strength of the Huber loss is its flexibility to induce robust and efficient estimates in different data situations. The classical L _{1} loss, for comparison, leads to an efficiency of only 63.7% under Gaussian errors (cf. [3], p.21).
An alternative composite loss function for robust regression is Tukey’s Bisquare loss [27], which is a nonconvex and smooth loss function defined by
The gradient of ρ _{ B } vanishes for outer parts of the distribution of z (leading to a socalled redescending estimator). This highlights a clear distinction between noncorrupted observations and potential outliers. Taking scaling into account, the 95% asymptotic efficiency under Gaussian distributed errors is reached for the Bisquare loss with the robustness threshold k = 4.685 [31]. A larger robustness threshold k implicitly leads to fewer observations being considered as outliers, while a smaller robustness threshold k leads to a larger amount of observations weighted by the outer part of the loss function. Figure 1 illustrates the composite Huber and Bisquare loss functions in comparison to standard L _{1} and L _{2} losses.
For a given robustness threshold k, the task is then to solve the minimization problem (3) in order to estimate the regression coefficients (β _{0}, β ). For the lowdimensional case (n > p), one can, for example, use the iteratively reweighted least squares method (IRLS). For high and ultrahighdimensional cases, where fewer observations than explanatory variables are available (n < p or even n ≪ p), this is no longer feasible.
2.2 Statistical Boosting
Statistical boosting adapts a concept from machine learning to estimate statistical models for highdimensional data, incorporating implicit variable selection and shrinkage of effect estimates through an iterative fitting process [32].
The empirical risk function
Afterwards prespecified sets of baselearners for each covariate are fitted componentwise onebyone to the gradient vector. Typically, each baselearner
Finally, the current model is updated with this bestperforming baselearner fit by
In general, statistical boosting can be seen as an alternative approach to penalized regression methods as the lasso, where similar performance and model are expected even if the concept is very different (for a detailed comparison of both approaches, see Hepp et al. [34]). The lasso directly penalizes the absolute value of the coefficient estimates leading to shrinkage towards zero and sparse models. Boosting, in contrast, indirectly imposes regularization via early stopping of the algorithm. For the lasso the main tuning parameter is the shrinkage parameter λ. Larger values of λ lead to stronger regularized effect estimates and sparser models. In comparison to boosting, the lasso is in general faster in respect to computing time. On the other hand, one main advantage of boosting is its modular structure and the resulting flexibility, e.g., in combining different sets of baselearners and loss functions [34], which we also take advantage of in our work. The lasso and a nonsparse robust version are used as a comparison in the simulation study (see Section 3) and for the analysis of the NCI60 panel (see Section 4).
2.3 Adaptive robust loss functions for boosting
We propose a new approach which combines the composite robust loss functions discussed in Section 2.1 and statistical boosting from Section 2.2 in order to obtain robust coefficient estimates and variable selection for highdimensional regression models. The key to our approach is the development of adaptive loss functions, where the robustness threshold iteratively depends on the current residuals and keeps the amount of observations in the inner and outer part of the loss function constant.
Due to the modular nature of the boosting algorithm it is rather straightforward to extend the approach to new loss functions, as they can essentially be plugged into the existing framework (see the appendix of [32] for an example). However, in case of composite robust loss functions like the Huber and Bisquare losses (see Figure 1), the choice of the robustness threshold k requires special attention. As a result of the iterative fitting process in statistical boosting, the sizes of the residuals vary considerably over different iterations m of the algorithm. If one simply uses the composite robust loss functions with a constant threshold k for all boosting iterations, there is the challenge of determining the robustness threshold k by scaling the outcome as discussed in Section 2.1. E.g., if the k has been chosen initially to small for the Bisquare loss, an uncontrolled amount of observations are considered as outliers, which means also that a lot of information will be lost. On the other hand, when the threshold k is chosen too large, almost all residuals will be evaluated only with the inner part of the loss function, resulting in nonrobust estimates. The same challenge as the adequate choice of k at the beginning of the algorithm occurs similarly for larger iterations m, when the residuals are decreasing in size. This is due to the fact that the componentwise gradient descent algorithm leads to a decreasing empirical risk function, which tends to result in overall smaller residuals. Hence, the proportion of observations falling in the outer part of the composite loss decreases (see Figure 2). To avoid such undesirable behaviour of gradient boosting, Friedman [35] recommended an adaptive approach for boosted classification trees by robustly reestimating the standard deviation of the residuals in each iteration of the algorithm. This approach was also followed and adapted in further works [28, 29].
Here, we propose to use adaptive robustness τquantiles with a fixed quantile level τ to maintain the initial balance between the two parts of the composite loss function throughout all boosting iterations. Therefore, as illustrated in Figure 2, the loss function with its robustness threshold k is automatically adapted to the sizes of the residuals
where k
^{[m]} is the τquantile in the mth boosting iteration regarding the absolute values of the current residuals
In order to choose a sensible default robustness quantile level τ for statistical boosting with our approach, one may consider concepts from classical Mestimation under noncorrupted Gaussian errors (see Section 2.1). The quantile level τ could hence be determined as the amount of residuals in the interval [−k, k] with the standard Gaussian cdf Φ. In particular, for the adaptive Huber loss, the popular robustness threshold k = 1.345 used in classical Mestimation (based on 95% efficiency) would lead to the quantile level τ ≈ 0.82. For the Bisquare loss, the classical robustness threshold of k = 4.685 (for the same efficiency under Gaussian distributed errors [31]) would be reflected by a quantile level τ ≈ 1. Note that, by construction (see Figure 1), the Bisquare loss assigns a constant value to all residuals larger than the robustness threshold k. However, for robustness quantile levels τ very close to one, no observations would fall into the outer robust part of the loss. Thus, in our simulations and the gene expression data application, we only consider robustness quantile levels with τ ≤ 0.99 (e.g., τ ∈ {0.9, 0.95, 0.99}), which guarantees a robust behaviour of the fitting process in the sense, that always at least 1% of the observations are considered by the robust part of the loss function.
In practical biomedical applications it is rarely known how many observations might be corrupted in some way. However, in situations where a prior estimate for the amount of corruption π is available, this could also be considered when choosing a sensible robustness quantile level τ. Note that in our approach the quantile level τ describes the quantity of observations evaluated by the inner part of a loss function (scaleindependently), while 1 − τ reflects the quantity evaluated by the outer part of the loss function. In case of the Bisquare loss, one could consider a direct and interpretable translation τ = 1 − π between the robustness quantile level τ and the assumed amount of corruption π. For the Huber loss, specifying the robustness quantile level τ based on an assumed amount of corruption π is slightly more difficult. In contrast to the Bisquare loss, also in the robust outer part of the Huber loss the current residuals enter with their actual sizes – not with fixed values. Considering a direct translation like τ = 1 − π might not yield the intended robustness. Applying Eq. (4) based on the minimax result of Huber [30], the popular robustness threshold k = 1.345 corresponds to an amount of corruption π of around 5.8%.
In the simulation study (Section 3) and the highdimensional cancer cell lines application (Section 4) we investigated different choices of quantile levels τ for the adaptive loss functions, considering also different types and amounts of assumed corruption (Section 3). Overall, in the considered settings the results of our approach were not very sensitive regarding the choice of the quantile level τ.
2.4 Implementation
The new adaptive robust loss functions for statistical boosting are implemented as corresponding family objects for the R addon package mboost [36], see also for a tutorial [32]. This way, our approach builds up on a welltested implementation, which includes also a flexible functionality regarding tuning, prediction and graphical representations of resulting models and effects. These new loss functions can be specified via the argument family = AdaptHuber()or family = AdaptBisquare() inside the fitting functions of mboost and are available on GitHub (https://github.com/JSpBonn/RobustlossesStatisticalBoosting) as is the code for reproducing results of Section 3 and 4.
Inside these family functions, the user can specify one of the three different arguments tau, k, pi for the robustness parameter of the composite loss functions, whose relation is described in Section 2.3. The specified argument will always be transferred to a robustness quantile level τ ∈ (0, 1). If no argument is specified, the default robustness quantile level τ is around 0.821 for the adaptive Huber loss, which corresponds to the robustness threshold k = 1.345. For the adaptive Bisquare loss, the default robustness quantile level is set to τ = 0.99, which can be interpreted as an approximation related to the oftenused robustness threshold of k = 4.685 (when assuming noncorrupted Gaussian responses). The offset value for these adaptive robust loss functions is set to the median of the response, which is more robust to outliers than the empirical mean.
As an illustrative example, the linear regression model y = β _{0} + β _{1} x _{1} + β _{2} x _{2} can be fitted with statistical boosting in combination with the adaptive Huber loss using the robustness quantile level τ = 0.8 via:
mod_h < glmboost(y ∼ x1 + x2, data = data.frame(y, x1, x2), family = AdaptHuber(tau = 0.8))
The fitting function glmboost() automatically applies linear baselearners, although alternative implementations for nonlinear spline baselearners are also available [32]. Furthermore, different tuning methods for the stopping iteration m _{stop} can be generically combined with the adaptive robust loss functions.
3 Simulations
The aim of our simulation study was to investigate the predictive performance of statistical boosting in combination with the proposed adaptive robust loss functions – also in comparison to classical estimation methods for robust regression, boosting with classical L _{1} and L _{2} losses and also lasso approaches. In particular, we examined the potential benefits of using adaptive robust loss functions when parts of the data are corrupted, as well as potential limitations when they are applied in uncorrupted data situations. Additionally, we also investigated the variable selection properties of the considered boosting and lasso approaches.
3.1 Settings
We considered a linear regression setting with p ∈ {50, 500, 15000} explanatory variables, of which only a small subset of 6 variables was truly informative. The true underlying coefficient vector was set to
To generate training data of size n, we simulated the covariates
To compare the prediction performance of the different methods, for each simulation replicate we generated n
_{test} = 1,000 uncorrupted test observations by the same process to evaluate the mean absolute error of prediction (MAEP) by calculating
To investigate the effect of the robustness quantile level τ for the adaptive Huber loss, we specified quantile levels corresponding to the efficiency criterion limits for robustness thresholds k = 1 and k = 2 (see Eq. (4)). In addition, we also examined the popular and established threshold value of k = 1.345 [31]. The considered thresholds k correspond to robustness quantile levels τ = 0.683 (k = 1), τ = 0.821 (k = 1.345) and τ = 0.954 (k = 2). For the Bisquare loss, besides the robustness quantile level τ = 0.99 that approximately corresponds to the established threshold k = 4.685, we also considered the robustness quantile levels τ = 0.95 and τ = 0.90. In practice, this can be interpreted as an assumed amount of corruption π of 1%, 5% and 10% of the outcome data, respectively, as discussed in Section 2.3. For all settings, we fitted gradient boosting models with the adaptive robust loss functions as well as with the L _{2} and L _{1} losses, using a 25fold bootstrap approach to tune the stopping iteration m _{stop} for each boosting model on the training data. Thereby, the training data are sampled with replacement, observations inside the bootstrap sample are used to fit the model for a grid of iterations until a given maximal iteration m and the remaining observations that have not been sampled are used for evaluating the predictive risk. The best stopping iteration is chosen as the iteration that minimizes the mean predictive risk over 25 bootstrap samples [32]. To put the results into perspective, for the lowdimensional settings we also computed the classical least squares estimator, as well as the classical robust Mestimators for the Huber and Bisquare losses. As comparison in all simulations the lasso was calculated using cv.glmnet() (package glmnet) and also a robust nonsparse version of the lasso specifying rq(model, method = "lasso") (package quantreg). For componentwise gradient boosting we used the implementation provided in the R addon package mboost, while for the classical estimation approaches we applied the lm() and rlm() (package MASS) functions.
3.2 Prediction accuracy
The average prediction performance of the considered methods in terms of MAEP are provided in Table 1 for all settings. Boxplots of MAEPs for the specific scenario with a corruption rate of 15% are shown in Figure 4.
Type of corruption  –  Nonsystematic  Systematic  

Amount of corruption  0%  7.5%  15%  7.5%  15%  
Lowdimensional  LS/Mestimation:  
L _{2} loss  0.926 (0.034)  1.055 (0.074)  1.172 (0.092)  1.427 (0.119)  1.795 (0.173)  
Huber loss  0.935 (0.037)  0.965 (0.043)  1.014 (0.054)  0.992 (0.052)  1.118 (0.093)  
Bisquare loss  0.940 (0.039)  0.960 (0.042)  1.003 (0.053)  0.942 (0.039)  0.959 (0.042)  
Lasso:  
L _{2} loss  0.843 (0.027)  0.893 (0.045)  0.941 (0.060)  1.042 (0.080)  1.194 (0.124)  
L _{1} loss  0.859 (0.032)  0.869 (0.036)  0.881 (0.038)  0.872 (0.037)  0.891 (0.041)  
Classic boosting:  
L _{2} loss  0.841 (0.026)  0.890 (0.045)  0.937 (0.059)  1.036 (0.075)  1.190 (0.124)  
L _{1} loss  0.866 (0.033)  0.878 (0.039)  0.893 (0.043)  0.883 (0.041)  0.905 (0.046)  
Boosting Huber:  
τ = 0.954 (k = 2)  0.842 (0.027)  0.860 (0.032)  0.903 (0.048)  1.022 (0.071)  1.188 (0.121)  
τ = 0.821 (k = 1.345,  0.845 (0.027)  0.855 (0.030)  0.871 (0.034)  0.862 (0.033)  0.907 (0.050)  
def.)  
τ = 0.683 (k = 1)  0.850 (0.028)  0.860 (0.031)  0.873 (0.033)  0.867 (0.034)  0.888 (0.038)  
Boosting Bisquare:  
τ = 0.99 (def.)  0.864 (0.033)  0.854 (0.030)  0.873 (0.036)  0.860 (0.034)  0.914 (0.058)  
τ = 0.95  0.890 (0.040)  0.875 (0.038)  0.869 (0.035)  0.850 (0.029)  0.877 (0.041)  
τ = 0.90  0.924 (0.047)  0.912 (0.048)  0.898 (0.044)  0.884 (0.040)  0.861 (0.032)  
Highdimensional  Lasso:  
L _{2} loss  0.839 (0.023)  0.887 (0.035)  0.930 (0.046)  1.024 (0.059)  1.161 (0.087)  
L _{1} loss  0.861 (0.026)  0.871 (0.028)  0.881 (0.030)  0.875 (0.029)  0.892 (0.033)  
Classic boosting:  
L _{2} loss  0.837 (0.023)  0.883 (0.034)  0.925 (0.044)  1.036 (0.075)  1.190 (0.124)  
L _{1} loss  0.860 (0.027)  0.871 (0.030)  0.881 (0.033)  0.883 (0.041)  0.905 (0.046)  
Boosting Huber:  
τ = 0.954 (k = 2)  0.839 (0.023)  0.857 (0.027)  0.896 (0.036)  1.022 (0.071)  1.188 (0.121)  
τ = 0.821 (k = 1.345,  0.843 (0.024)  0.857 (0.027)  0.870 (0.029)  0.862 (0.033)  0.907 (0.050)  
def.)  
τ = 0.683 (k = 1)  0.877 (0.041)  0.895 (0.043)  0.902 (0.040)  0.867 (0.034)  0.888 (0.038)  
Boosting Bisquare:  
τ = 0.99 (def.)  0.857 (0.026)  0.850 (0.026)  0.865 (0.029)  0.860 (0.034)  0.914 (0.058)  
τ = 0.95  0.885 (0.032)  0.869 (0.030)  0.860 (0.029)  0.850 (0.029)  0.877 (0.041)  
τ = 0.90  0.922 (0.039)  0.906 (0.037)  0.889 (0.035)  0.884 (0.040)  0.861 (0.032)  
Ultrahighdimensional  Lasso:  
L _{2} loss  1.020 (0.057)  1.184 (0.098)  1.311 (0.121)  1.550 (0.125)  1.701 (0.110)  
L _{1} loss  1.089 (0.073)  1.134 (0.083)  1.192 (0.099)  1.169 (0.096)  1.289 (0.136)  
Classic boosting:  
L _{2} loss  1.022 (0.058)  1.177 (0.096)  1.294 (0.112)  1.505 (0.103)  1.654 (0.084)  
L _{1} loss  1.194 (0.091)  1.247 (0.103)  1.306 (0.114)  1.291 (0.116)  1.399 (0.136)  
Boosting Huber:  
τ = 0.954 (k = 2)  1.028 (0.060)  1.105 (0.077)  1.220 (0.103)  1.467 (0.099)  1.650 (0.082)  
τ = 0.821 (k = 1.345,  1.041 (0.062)  1.092 (0.075)  1.159 (0.092)  1.150 (0.095)  1.394 (0.156)  
def.)  
τ = 0.683 (k = 1)  1.058 (0.066)  1.105 (0.077)  1.161 (0.092)  1.147 (0.092)  1.278 (0.132)  
Boosting Bisquare:  
τ = 0.99 (def.)  1.116 (0.074)  1.097 (0.079)  1.167 (0.094)  1.164 (0.110)  1.421 (0.157)  
τ = 0.95  1.258 (0.097)  1.223 (0.112)  1.218 (0.120)  1.076 (0.079)  1.271 (0.147)  
τ = 0.90  1.431 (0.122)  1.407 (0.129)  1.397 (0.143)  1.220 (0.100)  1.157 (0.110) 
For the lowdimensional setting (with p = 50 and n = 200) with nonsystematic corruption, the boosting models and the lasso approaches tended to yield a better prediction accuracy than the classical LS or Mestimation approaches. As expected, boosting and the lasso behaved very similar both for the L _{1} and L _{2} loss. For the lowdimensional setting with systematic corruption, only the Bisquare loss with classical Mestimation provided a prediction accuracy which is comparable to the robust boosting approaches. Even though the adaptive Huber loss with robustness quantile level τ = 0.954 (corresponding to the robustness threshold k = 2) led to the highest MAEP of the adaptive robust boosting losses for systematic corruption, its performance for noncorrupted data regarding prediction accuracy is comparable to the L _{2} boosting approach and even better for nonsystematic corruption. Except for the 15% systematic corruption, boosting with the adaptive Huber loss using the default robustness quantile level τ = 0.821 performed best or close to the best methods in each setting. In situations with corrupted data, at least one choice for the robustness quantile level in the adaptive Bisquare losses yielded the best performance of all methods, e.g. Bisquare with robustness quantile level τ = 0.95 performed best for 15% nonsystematic and 7.5% systematic corruption.
Similar results were obtained in the highdimensional settings (with p = 500 and n = 400). The Huber loss with default robustness quantile level performed at least third best regarding prediction performance for all highdimensional corrupted settings except for the case of 15% systematic corruption, where the Bisquare loss with robustness quantile level τ = 0.90 yielded the lowest average MAEP. The L _{1} loss for boosting and the lasso resulted in slightly worse predictions than the best approaches of each setting. As expected, the original L _{2} boosting approach performed best for noncorrupted data.
For the ultrahighdimensional setting (with p = 15,000 and n = 200) all methods behaved in the same manner as for the other settings to the different amounts and kinds of corruption, while leading in general to an overall lower prediction accuracy than for the scenarios with smaller p.
In general, we observed a competitive prediction performance for all boosting loss functions except for the L _{2} loss and the Huber loss with k = 2 (τ = 0.954) in cases of larger amounts of corrupted observations, particularly with systematic corruption. The lasso led to a similar prediction accuracy as boosting in all settings for both considered loss functions. Also the sensitivity analysis with Topelitz correlation structure and nonsymmetric corruption led to overall similar results (cf. to Online supplement Table S3 and Table S7).
3.3 Variable selection properties
To investigate the variable selection properties of boosting in combination with adaptive robust loss functions, we evaluated the true positive rate (sensitivity) and true negative rate (specificity) for all settings (summarized in Table 2) for the 6 informative variables and p − 6 noninformative variables.
Type of corruption  –  Nonsystematic  Systematic  

Amount of corruption  0%  7.5%  15%  7.5%  15%  
Lowdimensional  Lasso:  
L _{2} loss  1.00/0.72  0.99/0.73  0.97/0.73  0.91/0.75  0.82/0.77  
L _{1} loss  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  
Classic boosting:  
L _{2} loss  1.00/0.70  0.99/0.74  0.97/0.76  0.91/0.79  0.79/0.84  
L _{1} loss  1.00/0.67  1.00/0.68  0.99/0.68  1.00/0.68  0.99/0.69  
Boosting Huber:  
τ = 0.954 (k = 2)  1.00/0.49  1.00/0.55  0.99/0.60  0.93/0.64  0.80/0.72  
τ = 0.821 (k = 1.345, def.)  1.00/0.66  1.00/0.65  1.00/0.66  1.00/0.64  0.99/0.74  
τ = 0.683 (k = 1)  1.00/0.59  1.00/0.59  1.00/0.59  1.00/0.56  1.00/0.61  
Boosting Bisquare:  
τ = 0.99 (def.)  1.00/0.67  1.00/0.69  1.00/0.72  1.00/0.78  0.98/0.86  
τ = 0.95  1.00/0.60  1.00/0.66  1.00/0.67  1.00/0.68  1.00/0.79  
τ = 0.90  0.99/0.49  0.99/0.55  1.00/0.60  1.00/0.64  1.00/0.72  
Highdimensional  Lasso:  
L _{2} loss  1.00/0.94  1.00/0.94  0.98/0.94  0.91/0.95  0.80/0.96  
L _{1} loss  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  
Classic boosting:  
L _{2} loss  1.00/0.94  1.00/0.95  0.98/0.95  0.90/0.96  0.78/0.97  
L _{1} loss  1.00/0.92  1.00/0.92  1.00/0.92  1.00/0.92  1.00/0.92  
Boosting Huber:  
τ = 0.954 (k = 2)  1.00/0.82  1.00/0.86  1.00/0.89  0.94/0.90  0.79/0.95  
τ = 0.821 (k = 1.345, def.)  1.00/0.90  1.00/0.89  1.00/0.89  1.00/0.88  1.00/0.93  
τ = 0.683 (k = 1)  1.00/0.78  1.00/0.77  1.00/0.79  1.00/0.75  1.00/0.84  
Boosting Bisquare:  
τ = 0.99 (def.)  1.00/0.92  1.00/0.94  1.00/0.95  1.00/0.96  0.99/0.98  
τ = 0.95  1.00/0.88  1.00/0.91  1.00/0.93  1.00/0.94  1.00/0.97  
τ = 0.90  1.00/0.82  1.00/0.86  1.00/0.89  1.00/0.90  1.00/0.95  
Ultrahighdimensional  Lasso:  
L _{2} loss  0.94/1.00  0.79/1.00  0.69/1.00  0.45/1.00  0.16/1.00  
L _{1} loss  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  1.00/0.00  
Classic boosting:  
L _{2} loss  0.95/1.00  0.80/1.00  0.71/1.00  0.52/1.00  0.22/1.00  
L _{1} loss  0.88/0.99  0.84/0.99  0.79/0.99  0.80/1.00  0.70/0.99  
Boosting Huber:  
τ = 0.954 (k = 2)  0.94/1.00  0.86/1.00  0.77/1.00  0.57/1.00  0.23/1.00  
τ = 0.821 (k = 1.345, def.)  0.94/1.00  0.88/1.00  0.82/1.00  0.83/1.00  0.62/1.00  
τ = 0.683 (k = 1)  0.93/0.99  0.89/0.99  0.82/0.99  0.85/0.99  0.74/1.00  
Boosting Bisquare:  
τ = 0.99 (def.)  0.91/0.99  0.88/1.00  0.80/1.00  0.79/1.00  0.54/1.00  
τ = 0.95  0.83/0.99  0.84/0.99  0.80/0.99  0.89/1.00  0.68/1.00  
τ = 0.90  0.72/0.98  0.74/0.99  0.71/0.99  0.84/0.99  0.80/1.00 
The mean true positive rates for all robust boosting approaches for the highdimensional setting were 1.00, except for the Bisquare loss with τ = 0.99 for 15% systematic corrupted data (0.99) and also the Huber loss with τ = 0.954 for systematic corruption, which performs similar to the L _{2} loss for boosting and lasso. The same was true for the lowdimensional setting, where all other robust approaches have a true positive rate of at least 0.99. Boosting with the L _{2} loss started also with 1.00 for noncorrupted data, but with rising amounts of corruption dropped to 0.97 for nonsystematic and to 0.79 for 15% systematic corrupted data.
The true negative rate generally differs between the low and highdimensional settings, since a selected variable in the lowdimensional setting has a bigger impact on the mean percentage number of (falsely) selected variables, resulting in generally lower rates for the lowdimensional setting. For 0% corruption, boosting with the L _{2} loss led to a mean true negative rate of 0.70, while the mean true negative rates for the robust approaches ranged between 0.49 and 0.67, with higher rates for the default Huber (0.66) and Bisquare (0.67) losses. For corrupted data situations, the L _{2} loss still yielded at least the second highest rate, with higher rates for higher amounts of corruption and also for systematic data corruption. This applies also for most robust loss functions, where boosting with the Bisquare loss using the robustness quantile level τ = 0.99 yielded a mean true negative rate of 0.86 for 15% systematic corruption, which is the highest for this setting.
The highdimensional settings showed generally similar results regarding the true negative rate. The standard L _{2} boosting approach yielded at least a mean true negative rate of 0.94, which is close to highest in all settings. Besides the nonadaptive L _{1} approach, the adaptive Bisquare loss with default robustness quantile level as well as with quantile level τ = 0.95 resulted also in quite high true negative rates, whose means range from 0.92 to 0.98 and from 0.88 to 0.97, respectively. Using lower robustness quantile levels resulted in a higher variability for different settings with a minimum mean true negative rate of 0.75 (Huber loss with k = 1).
Due to the large number of parameters p = 15,000 for the ultrahighdimensional setting the mean true negative rates are rather close to 1. The true positive rate decreased drastically for the nonrobust L _{2} loss function for lasso and boosting from 0% corruption over nonsystematic to 15% systematic corruption from 0.94 to 0.16 and 0.95 to 0.22, respectively. The least robust Huber loss (k = 2) performed still comparably good for nonsystematic corruption, but ended also with a mean true positive rate of 0.23 for 15% systematic corruption. In contrast the more robust Huber loss with default quantile or τ = 0.683 performed for both kinds of corruption well, as does the Bisquare loss, especially for systematic corruption.
In comparison to the other losses, boosting with the L _{2} loss resulted in a higher mean true negative rate for the same simulation settings. This is due to a relatively early stopping of the algorithm in case of corrupted data situations, as the tuning of m _{stop} via resampling on the corrupted training data detects faster overfitting. The classical lasso with L _{2} loss behaved again very similar to the L _{2} boosting approach for each kind and each amount of corruption and also respectively for the low, high and ultrahighdimensional setting. Regarding variable selection L _{1} lasso was not comparable due to nonsparse model fits. The positive predictive value and negative predictive value for the simulations can be found in Table S1 in the supplement, where also the results of the sensitivity analyses (see Table S4, S5 and Table S8, S9) are reported, which led overall to comparable results to the original scenarios.
3.4 Interpretation of simulation results
Our simulations indicate that the LS/Mestimators tend to overfit the training data even in the relatively lowdimensional setting with p = 50 and n = 200, as no regularization or variable selection is incorporated. As expected, this can be already observed in settings without any corrupted observations.
This is not the case for the considered boosting approaches, where particularly the standard L _{2} loss resulted in a high mean true negative rate for boosting and the lasso. Even though this is a major advantage for noncorrupted data situations, it also led to lower mean true positive rates particularly in systematically corrupted settings in comparison to all robust boosting approaches. Additionally, the quadratic influence of the L _{2} loss and its sensitivity to (systematic) corruption led to higher MAEPs, whereas for larger amounts of corrupted observations and both types of corruptness, the Bisquare loss still worked quite well. A plausible reason for this is the bounded structure of the loss function, which drastically reduces the influence of large outliers.
For such cases with corrupted data the performance for adaptive robust loss functions (Huber and Bisquare) from the boosting approach are very promising. Competitive results were obtained by the default robustness quantile levels of the loss functions, which performed robust under all considered scenarios. However, prior knowledge about the true amount of corrupted observations can be advantageous as the robustness quantile level may be specified accordingly: one may choose a lower robustness quantile level τ for the composite robust losses when the amount of potentially corrupted observations is high. In particular, the adaptive Huber loss function performed best for nonsystematic corruption, but could lead to higher MAEP in settings where the chosen robustness quantile level τ deviates too much from the underlying amount of systematic corruption (e.g., Huber k = 2, τ = 0.954 with 15% systematic corruption), so that even the Huber loss turned out to be not robust enough anymore.
The adaptive Bisquare loss performed well for different types of corruption, particularly for systematic corruption and higher amounts of corrupted observations. Using the (default) robustness quantile level (τ = 0.99) is a reasonable choice if one has no information on outliers or the amount of corruption. If the proportion of corrupted observations was high, adjusting the robustness quantile level τ (lowering it to 0.95 or even 0.90) still led to a very competitive prediction performance and increased the true positive rate while slightly reducing the true negative rate.
However, for the adaptive Huber loss, in practical situations without information on how many observations could be outliers, we recommend to keep the default robustness quantile level τ – even if that means underestimating the amount of corruption in some situations. When using the adaptive Bisquare loss in boosting, the advantages are particularly evident for lower quantile levels (e.g., τ = 0.95), although for uncorrupted data this could lead to a lower prediction accuracy compared to the L _{2} loss.
All considered sensitivity analysis on different correlation structure or nonsymmetric corruption (see onlinesupplement) led to overall similar results in comparison to the presented scenarios in the main manuscript.
In practical prediction modelling, robust loss functions could at least serve for sensitivity analyses even when the data are not assumed to be contaminated. Particularly in our considered highdimensional and ultrahighdimensional data situations, boosting in combination with the adaptive robust loss functions always yielded a reasonable performance regarding prediction accuracy and the selection of informative variables – no matter if or how much the outcome was corrupted.
4 Protein expression data of the NCI60 cancer cell panel
As a result of computational advances and the establishment of large cohort studies, many (ultra)highdimensional data applications can nowadays be found in the areas of genomics and proteomics [37, 38]. In this context, sparser and robust models can enhance the interpretability and can point to a subset of variants, genes or proteins for further investigation.
To illustrate our approach, we considered an NCI60 data set with 59 human cancer cell lines (cf. [39]), which is publicly available at https://discover.nci.nih.gov/cellminer/. Data based on this cancer cell panel has already been analyzed using various alternative robust approaches [40, 41], machine learning algorithms [42, 43] and variable selection methods [44, 45]. As potential predictors we chose the gene expression measurements from the Affymetrix HGU133AB chip, which are normalized using the guanine cytosine robust multiarray analysis.
In case of multiple measurements of the same genes for one sample we considered the median gene expression measurement, leading to p = 14, 951 explanatory variables. As outcome variable we considered the original KRT19 protein antibody array of the “Protein: Antibody Array DTB” set [39]. KRT19 has the largest kurtosis in comparison to the other available protein expression data [39], which makes it a particular interesting data example for robust regression approaches.
As a measure of predictive accuracy, we computed the mean absolute error of prediction (MAEP) via leaveoneout cross validation for n = 59 observations using successively 58 observations as training data and the omitted observation as test data. This results in fitting 59 models, each of which predicts one observation generating one absolute error. Additionally, we compared the selection frequencies of genes for different settings for the boosting families presented in Section 2.3 using different robustness quantile levels τ. As baselearners we incorporated separate linear models for each of the p = 14, 951 candidate genes. Tuning of the stopping iteration was performed in the same way as in Section 3 via 25fold bootstrapping. More details and reproducible code can be found on GitHub (https://github.com/JSpBonn/RobustlossesStatisticalBoosting). Beside the comparison to the same lasso approaches as in Section 3, we used sparseLTS() (sparse least trimmed squares regression, [40]) and rgrplars() (robust groupwise least angle regression, [46]) as additional robust methods for comparison. These functions were applied with default settings available in the R package robustHD [47].
Taking the large number of predictor variables into account, there were only rather small subsets of genes selected by the componentwise gradient boosting algorithms (Figure 5), resulting in sparse models. Comparing the average sizes of the models for different loss functions from Table 3, classical boosting with the L _{2} and L _{1} losses already resulted in sparse models containing on average 6.83 and 8.93 genes, respectively. Even sparser models resulted from the quantilebased adaptive Huber and Bisquare loss functions with a maximum average of 2.95 selected variables for the Bisquare loss with robustness quantile level τ = 0.90. Here, a larger robustness quantile level τ led to sparser and more interpretable models.
Method  MAE (SD)  Mean Size (SD)  Mean Time (SD) 

Lasso:  
L _{2} loss  33.82 (51.91)  2.36 (2.88)  1.03 (0.05) 
L _{1} loss  35.99 (47.81)  14,951.00 (0.00)  8509.11 (1475.78) 
Classic boosting:  
L _{2} loss  36.32 (35.13)  6.83 (5.12)  20.14 (0.90) 
L _{1} loss  27.93 (51.38)  8.93 (2.84)  20.17 (1.14) 
Boosting Huber:  
τ = 0.954 (k = 2)  28.34 (50.00)  1.69 (0.84)  23.83 (3.20) 
τ = 0.821 (k = 1.345, def.)  26.98 (48.61)  1.88 (1.20)  21.50 (0.72) 
τ = 0.683 (k = 1)  27.69 (49.92)  2.46 (3.54)  39.76 (1.73) 
Boosting Bisquare:  
τ = 0.99 (def.)  28.61 (50.72)  0.85 (0.69)  22.20 (2.46) 
τ = 0.95  28.17 (51.32)  1.47 (0.60)  21.77 (2.13) 
τ = 0.90  28.45 (51.47)  2.95 (1.88)  21.77 (1.85) 
Robust competitors:  
rgrplars()  28.08 (51.56)  3.78 (1.49)  237.6 (209.79) 
sparseLTS()  28.27 (48.88)  42.34 (0.96)  13410.24 (120.29) 
While there were differences between the sets of selected variables for different robustness quantile levels τ, there was also an overlap within each adaptive robust loss function comparing different robustness quantile levels. For example, all chosen robustness quantile levels τ for Huber have selected the KRT19 geneexpression variable frequently and also EPS8L2, which was also the case for the Bisquare loss with τ = 0.99. Additionally, also ARHGEF5 was selected often in the on average slightly larger Huber models of the default robustness threshold and k = 1. For the other Bisquare losses with robustness quantile levels τ ∈ {0.95, 0.90}, the MORC4 gene had the highest overlap.
Considering the ultrahighdimensionality of this application, the variable selection results appeared to be rather insensitive to the choice of the robustness quantile level. Comparable findings could be observed regarding the prediction performance. The adaptive robust loss functions led to a similar range for the MAE, where the Huber loss with default robustness quantile level τ led to the lowest prediction error (MAE = 26.98). In this example, the L _{2} loss was outperformed by all other considered loss functions, so were the lasso approaches (the L _{1} loss might overfit the training data due to not implemented sparsity). The prediction of the competing robust penalized regression approaches (sparse least trimmed squares regression, groupwise least angle regression) were within the range of the robust boosting approaches. The default robustness quantile level for the adaptive Huber loss was also performing well for this skewed outcome, but in such cases with unknown, even nonsymmetric error distributions it might also be beneficial to choose a lower robustness quantile level τ. The running time is within the boosting context really comparable, while the classical L _{2} lasso is faster. The L _{1} lasso needs more time due to high number of covariates as does the sparseLTS() function.
In summary, the different adaptive robust loss functions tended to select different variables for their prediction models; however, the final models were comparably sparse and internally consistent regarding the selection of the robustness quantile level τ.
5 Discussion
This article proposed adaptive robust loss functions to fit statistical models with componentwise gradient boosting based on a fixed quantile level to separate the two parts of composite losses. We reported promising results regarding the robustness against outliers in the response variable, providing evidence that our approach allows to select and estimate robust prediction models even in the context of highdimensional (p > n) data. In a simulation study with low, highdimensional and ultrahighdimensional settings we were able to show that the adaptive Huber and Bisquare losses are advantageous for regression situations where the outcome contains outliers or is affected by specific types of corruption. Additionally, in the presence of noncorrupted observations, they also led to a similar model performance compared to boosting with the efficient L _{2} loss. For nonsystematic corruption (see Figure 3), the adaptive Huber loss showed the best prediction accuracy, whereas the adaptive Bisquare loss was favorable in cases with systematic corruption. The ability of boosting to select sparse and interpretable models for highdimensional data is also preserved.
In the analysis of cancer cell lines in Section 4, representing an ultrahighdimensional setting with skewed responses, the algorithm selected very sparse models with the proposed robust losses, while leading to more accurate predictions compared with standard loss functions for boosting, but also in comparison to the lasso.
An earlier gradient boosting approach for the Huber loss from Lutz et al. [28] had focused on robustified (linear) baselearners, effectively correcting for potentially high leverage points in the explanatory variables, while iteratively applying a standardisation rule (scaling) for the residuals. But also very recent work [29] considered robust gradient boosting for linear regression. In contrast to our approach, a two step procedure was followed: first, an initial nonparametric robust scale estimate using regression trees is employed in each iteration before an Mtype estimate is updated.
In contrast to previous approaches, using the quantile level τ presented here does not rely on an additional scale estimation for the residuals or auxiliary scale tuning for every iteration. Adapting k ^{[m]} for iterations 1, … , m _{stop} while fixing the quantile level τ seems to be a simple but natural approach, since for a given application the amount of corruption in the response does not change during the modelling process. Hence, in our approach also the amount of observations falling in the inner or outer part of robust composite losses stays the same.
Currently, we have only focused on the combination of adaptive robust loss functions with linear baselearners, leading to robust linear regression models. However, due to the modular structure of the algorithm [9], these new loss functions could also be easily combined with any other type of existing baselearners (cf. [32]). Furthermore, we only addressed robustness with respect to contaminated outcomes (vertical corruption): in future work, our approach could be further extended to contaminated covariates (e.g., high leverage points/horizontal outliers) in a similar manner as in [28]. In general, using the modular structure of the boosting algorithm, baselearner with additional penalization (not only ordinary least squares estimates) can be combined with the robust adaptive loss functions to take vertical and horizontal outliers into account in one method. This flexibility when it comes to extensions is one major advantage of the algorithm in comparison to, e.g. lassobased methods, which have instead tend to lead to better running times, while achieving similar predictive performance for classic loss functions.
An inherited limitation of the gradient boosting approach is the lack of standard errors for effect estimates, making the classical construction of confidence intervals or hypothesis tests problematic. A potential solution to this issue are computational workarounds like permutation tests [48, 49] or resamplingbased confidence intervals [50], which were also investigated for parts of our simulation study (see online supplement Table S11). Another limitation, this time inherited from the field of robust regression, is that the composite robust loss functions we investigated here require an initial choice for the robustness parameter. In our case this is reflected by the robustness quantile level τ which implicitly adapts to the current residuals throughout the boosting iterations. As the quantile level τ is controlling the robustness of the final model, it could be specified based on information or assumptions about the amount of outliers or corrupted observations in the response. As this is usually not known in practice, from a computational perspective one could also treat the quantile level τ as an additional tuning parameter, and optimize it via resampling or crossvalidation to refit a final model based on the optimized τ value. Furthermore, it might be also of interest to investigate the influence of the adaptive k ^{[m]} on the optimal stopping iteration or the effective learning rate, respectively, and its effect on the resulting model size and regularization.
However, the results of our simulations and the proteinexpression application suggest that all considered τ values for the adaptive Huber and Bisquare losses worked reasonably well regarding prediction performance and variable selection. We hence do not propose an additional tuning of τ but recommend using the default fixed quantile levels for the Huber (τ = 0.821) and Bisquare (τ = 0.99) loss in the absence of knowledge about outliers or corrupted observations.
Future research perspectives in this context include the combination of automated outlier detection methods to specify the robustness quantile level τ, e.g. for the Bisquare loss. Another potential extension of our approach is to combine componentwise gradient boosting with Mquantile regression, an extension of quantile and expectile regression using robust Mestimation for a generalization of median and mean regression, respectively. Here the Huber loss function can be employed to estimate robust models for specific quantiles of the response distribution, leading to nonsymmetric loss functions (cf. [51]). Also an extension of robust boosting methods towards other types of regression models, e.g. distributional regression [52, 53], might be worth investigating (cf. [54]).
In terms of practical prediction modelling for biomedical applications, it always makes sense to check various modelling options, including robust approaches as the ones proposed in this work and to compare their results with respect to the objectives of the analysis.
Particularly for highdimensional data with potentially corrupted outcomes, statistical boosting with quantilebased adaptive losses is able to estimate and select interpretable prediction models that are both robust and sparse while yielding a competitive prediction accuracy.
Acknowledgement
The authors gratefully acknowledge fruitful discussions with Benjamin Hofner on the methodological concepts of this article.

Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

Research funding: None declared.

Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
1. Barrios, EB. Robustness, data analysis, and statistical modeling: the first 50 years and beyond. Commun Stat Appl Methods 2015;22:543–56. https://doi.org/10.5351/csam.2015.22.6.543.Search in Google Scholar
2. Huber, PJ. Robust statistics. New York: John Wiley & Sons; 1981.10.1002/0471725250Search in Google Scholar
3. Maronna, RA, Martin, RD, Yohai, VJ, SalibiánBarrera, M. Robust statistics: Theory and methods (with R), 2nd ed. New York: John Wiley & Sons; 2019.10.1002/9781119214656Search in Google Scholar
4. Susanti, Y, Pratiwi, H, Sulistijowati, S, Liana, T. M estimation, S estimation, and MM estimation in robust regression. Int J Pure Appl Math 2014;91:349–60. https://doi.org/10.12732/ijpam.v91i3.7.Search in Google Scholar
5. Fan, J, Lv, J. A selective overview of variable selection in high dimensional feature space. Stat Sin 2010;20:101–48.Search in Google Scholar
6. Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55–67. https://doi.org/10.1080/00401706.1970.10488634.Search in Google Scholar
7. Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.25176161.1996.tb02080.x.Search in Google Scholar
8. Bühlmann, P, Hothorn, T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci 2007;22:477–505. https://doi.org/10.1214/07sts242.Search in Google Scholar
9. Mayr, A, Hofner, B. Boosting for statistical modelling: a nontechnical introduction. Stat Model Int J 2018;18:365–84. https://doi.org/10.1177/1471082x17748086.Search in Google Scholar
10. Mayr, A, Hofner, B, Schmid, M. The importance of knowing when to stop. Methods Inf Med 2012;51:178–86. https://doi.org/10.3414/ME11020030.Search in Google Scholar PubMed
11. Schmid, M, Hothorn, T. Boosting additive models using componentwise Psplines. Comput Stat Data Anal 2008;53:298–311. https://doi.org/10.1016/j.csda.2008.09.009.Search in Google Scholar
12. Griesbach, C, Säfken, B, Waldmann, E. Gradient boosting for linear mixed models. Int J Biostat 2021;17:317–29. https://doi.org/10.1515/ijb20200136.Search in Google Scholar PubMed
13. Kneib, T, Hothorn, T, Tutz, G. Variable selection and model choice in geoadditive regression models. Biometrics 2009;65:626–34. https://doi.org/10.1111/j.15410420.2008.01112.x.Search in Google Scholar PubMed
14. Hepp, T, Zierk, J, Rauh, M, Metzler, M, Mayr, A. Latent class distributional regression for the estimation of nonlinear reference limits from contaminated data sources. BMC Bioinf 2020;21:524. https://doi.org/10.1186/s12859020038533.Search in Google Scholar PubMed PubMed Central
15. Thomas, J, Mayr, A, Bischl, B, Schmid, M, Smith, A, Hofner, B. Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates. Stat Comput 2018;28. https://doi.org/10.1007/s1122201797546.Search in Google Scholar
16. Wang, Z. Robust boosting with truncated loss functions. Electron J Stat 2018;12:599–650. https://doi.org/10.1214/18ejs1404.Search in Google Scholar
17. Fenske, N, Fahrmeir, L, Hothorn, T, Rzehak, P, Höhle, M. Boosting structured additive quantile regression for longitudinal childhood obesity data. Int J Biostat 2013;9:1–18. https://doi.org/10.1515/ijb20120035.Search in Google Scholar PubMed
18. Fenske, N, Kneib, T, Hothorn, T. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. J Am Stat Assoc 2011;106:494–510. https://doi.org/10.1198/jasa.2011.ap09272.Search in Google Scholar
19. Filzmoser, P, Nordhausen, K. Robust linear regression for highdimensional data: an overview. WIREs Comput Stat 2021;13:e1524. https://doi.org/10.1002/wics.1524.Search in Google Scholar
20. Zhao, J, Yan, G, Zhang, Y. Robust estimation and shrinkage in ultrahigh dimensional expectile regression with heavy tails and variance heterogeneity. Stat Pap 2022;63:1–28. https://doi.org/10.1007/s00362021012272.Search in Google Scholar
21. Lugosi, G, Mendelson, S. Mean estimation and regression under heavytailed distributions: a survey. Found Comput Math 2019;19:1145–90. https://doi.org/10.1007/s1020801909427x.Search in Google Scholar
22. Fan, J, Lv, J. Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc B 2008;70:849–911. https://doi.org/10.1111/j.14679868.2008.00674.x.Search in Google Scholar PubMed PubMed Central
23. Wang, H. Forward regression for ultrahigh dimensional variable screening. J Am Stat Assoc 2009;104:1512–24. https://doi.org/10.1198/jasa.2008.tm08516.Search in Google Scholar
24. Maronna, RA. Robust ridge regression for highdimensional data. Technometrics 2011;53:44–53. https://doi.org/10.1198/tech.2010.09114.Search in Google Scholar
25. Chang, L, Roberts, S, Welsh, A. Robust lasso regression using Tukey’s biweight criterion. Technometrics 2018;60:36–47. https://doi.org/10.1080/00401706.2017.1305299.Search in Google Scholar
26. Deutelmoser, H, Scherer, D, Brenner, H, Waldenberger, M, Suhre, K, INTERVAL study, et al.. Robust Huberlasso for improved prediction of protein, metabolite and gene expression levels relying on individual genotype data. Briefings Bioinf 2020;22. https://doi.org/10.1093/bib/bbaa230.Search in Google Scholar PubMed PubMed Central
27. Amato, U, Antoniadis, A, De Feis, I, Gijbels, I. Penalised robust estimators for sparse and highdimensional linear models. Stat Methods Appl 2021;30:1–48. https://doi.org/10.1007/s1026002000511z.Search in Google Scholar
28. Lutz, RW, Kalisch, M, Bühlmann, P. Robustified L2 boosting. Comput Stat Data Anal 2008;52:3331–41. https://doi.org/10.1016/j.csda.2007.11.006.Search in Google Scholar
29. Ju, X, SalibiánBarrera, M. Robust boosting for regression problems. Comput Stat Data Anal 2021;153:107065. https://doi.org/10.1016/j.csda.2020.107065.Search in Google Scholar
30. Huber, PJ. Robust estimation of a location parameter. Ann Math Stat 1964;35:73–101. https://doi.org/10.1214/aoms/1177703732.Search in Google Scholar
31. Yu, C, Yao, W. Robust linear regression: a review and comparison. Commun Stat Simulat Comput 2017;46:6261–82. https://doi.org/10.1080/03610918.2016.1202271.Search in Google Scholar
32. Hofner, B, Mayr, A, Robinzonov, N, Schmid, M. Modelbased boosting in R: a handson tutorial using the R package mboost. Comput Stat 2014;29:3–35. https://doi.org/10.1007/s0018001203825.Search in Google Scholar
33. Bühlmann, P. Boosting for highdimensional linear models. Ann Stat 2006;34:559–83. https://doi.org/10.1214/009053606000000092.Search in Google Scholar
34. Hepp, T, Schmid, M, Gefeller, O, Waldmann, E, Mayr, A. Approaches to regularized regression – a comparison between gradient boosting and the lasso. Methods Inf Med 2016;55:422–30. https://doi.org/10.3414/ME16010033.Search in Google Scholar PubMed
35. Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;29:1189–232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar
36. Hothorn, T, Bühlmann, P, Kneib, T, Schmid, M, Hofner, B. mboost: Modelbased boosting. R package version 2; 2021. 95.Search in Google Scholar
37. Loos, RJF. 15 years of genomewide association studies and no signs of slowing down. Nat Commun 2020;11:5900. https://doi.org/10.1038/s41467020196535.Search in Google Scholar PubMed PubMed Central
38. MacArthur, J, Bowler, E, Cerezo, M, Gil, L, Hall, P, Hastings, E, et al.. The new NHGRIEBI Catalog of published genomewide association studies (GWAS Catalog). Nucleic Acids Res 2016;45:896–901. https://doi.org/10.1093/nar/gkw1133.Search in Google Scholar PubMed PubMed Central
39. Sun, Q, Zhou, W, Fan, J. Adaptive Huber regression. J Am Stat Assoc 2020;115:254–65. https://doi.org/10.1080/01621459.2018.1543124.Search in Google Scholar PubMed PubMed Central
40. Alfons, A, Croux, C, Gelper, S. Sparse least trimmed squares regression for analyzing highdimensional large data sets. Ann Appl Stat 2013;7:226–48. https://doi.org/10.1214/12aoas575.Search in Google Scholar
41. Park, H. Outlierresistant highdimensional regression modelling based on distributionfree outlier detection and tuning parameter selection. J Stat Comput Simulat 2017;87:1799–812. https://doi.org/10.1080/00949655.2017.1287186.Search in Google Scholar
42. Archetti, F, Giordani, I, Vanneschi, L. Genetic programming for anticancer therapeutic response prediction using the NCI60 dataset. Comput Oper Res 2010;37:1395–405. https://doi.org/10.1016/j.cor.2009.02.015.Search in Google Scholar
43. Hemphill, E, Lindsay, J, Lee, C, Măndoiu, II, Nelson, CE. Feature selection and classifier performance on diverse bio logical datasets. BMC Bioinf 2014;15:S4. https://doi.org/10.1186/1471210515S13S4.Search in Google Scholar PubMed PubMed Central
44. Min, EJ, Safo, SE, Long, Q. Penalized coinertia analysis with applications to omics data. Bioinformatics 2018;35:1018–25. https://doi.org/10.1093/bioinformatics/bty726.Search in Google Scholar PubMed PubMed Central
45. Wu, C, Zhou, F, Ren, J, Li, X, Jiang, Y, Ma, S. A selective review of multilevel omics data integration using variable selection. HighThroughput 2019;8:4. https://doi.org/10.3390/ht8010004.Search in Google Scholar PubMed PubMed Central
46. Alfons, A, Croux, C, Gelper, S. Robust groupwise least angle regression. Comput Stat Data Anal 2016;93:421–35. https://doi.org/10.1016/j.csda.2015.02.007.Search in Google Scholar
47. Alfons, A. robustHD: an R package for robust regression with highdimensional data. J Open Source Software 2021;6:3786. https://doi.org/10.21105/joss.03786.Search in Google Scholar
48. Hepp, T, Schmid, M, Mayr, A. Significance tests for boosted location and scale models with linear baselearners. Int J Biostat 2019;15. https://doi.org/10.1515/ijb20180110.Search in Google Scholar PubMed
49. Mayr, A, Schmid, M, Pfahlberg, A, Uter, W, Gefeller, O. A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models. Stat Methods Med Res 2017;26:1443–60. https://doi.org/10.1177/0962280215581855.Search in Google Scholar PubMed
50. Hofner, B, Kneib, T, Hothorn, T. A unified framework of constrained regression. Stat Comput 2016;26:1–14. https://doi.org/10.1007/s112220149520y.Search in Google Scholar
51. Chambers, R, Dreassi, E, Salvati, N. Disease mapping via negative binomial regression Mquantiles. Stat Med 2014;33:4805–24. https://doi.org/10.1002/sim.6256.Search in Google Scholar PubMed
52. Mayr, A, Fenske, N, Hofner, B, Kneib, T, Schmid, M. Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting. J Roy Stat Soc: Ser C (Appl Stat) 2012;61:403–27. https://doi.org/10.1111/j.14679876.2011.01033.x.Search in Google Scholar
53. Rigby, RA, Stasinopoulos, DM. Generalized additive models for location, scale and shape. J Roy Stat Soc: Ser C (Appl Stat) 2005;54:507–54. https://doi.org/10.1111/j.14679876.2005.00510.x.Search in Google Scholar
54. Aeberhard, WH, Cantoni, E, Marra, G, Radice, R. Robust fitting for generalized additive models for location, scale and shape. Stat Comput 2021;31:11. https://doi.org/10.1007/s1122202009979x.Search in Google Scholar
Supplementary Material
The online version of this article offers supplementary material (https://doi.org/10.1515/ijb20210127).
© 2022 Walter de Gruyter GmbH, Berlin/Boston