Skip to content
Publicly Available Published by De Gruyter May 26, 2016

Statistical Inference for Data Adaptive Target Parameters

  • Alan E. Hubbard EMAIL logo , Sara Kherad-Pajouh and Mark J. van der Laan

Abstract

Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming “data-driven”, the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

1 Introduction

A proliferation of statistical/data science methods have accompanied a growing systematic collection of data across many scientific fields. Progress has been made developing quantitative statistical methods well-suited to exploratory analysis, however, much remains much to be done for deriving estimators and robust inference of relevant parameters in such a context. The growth of fields such as precision medicine and high dimensional (high throughput) biology try to capitalize on the resulting “big data” by inspired pattern-finding procedures [1]; less emphasis has been given to formally defining the parameters such procedures “discover”. Thus, an obvious first step necessary for driving theoretical results is to explicitly define such data adaptive parameters. The goal of previous work [2] and this paper is address the issue of rigorous inference when the target parameter is not pre-specified.

We note that there is a literature on the dangers of deriving parameters data-adaptively. The common wisdom for deriving consistent inference for a data-adaptively defined parameter is to use sample-splitting, where one of the splits is a training set used to define the parameter, and the left out then estimates this parameter on the independent “estimation” sample. Quoting Dwork et al. [3]:

The “textbook” advice for avoiding problems of this type is to collect fresh samples from the same data distribution whenever one ends up with a procedure that depends on the existing data. Getting fresh data is usually costly and often impractical so this requires partitioning the available dataset randomly into two or more disjoint sets of data (such as a training and testing set) prior to the analysis. Following this approach conservatively with m adaptively chosen procedures would significantly (on average by a factor of m) reduce the amount of data available for each procedure.

Our main proposed approach aims to keep the data-adaptive part of the sample splitting algorithm described in quote, but to define an average of the data-adaptive parameter across arbitrary splits of this sort (we emphasize V-fold cross-validation below). In this way, one can still use the power of the entire dataset while avoiding strong conditions on the algorithms used to data-adaptively define the parameters.

1.1 Motivating example

The general methodology for estimation and inference for data adaptive parameters is presented below, however, we illustrate the method by a particularly challenging causal inference estimation problem. Consider data from the (W)estern (C)collaborative (G)roup (S)tudy [4], a prospective study of risk factors of coronary heart disease (CHD). The study consisted of 3,524 males (3,142 of which had complete data) aged 39–59, working in certain California corporations, who were enrolled at the outset of the study and followed for 8.5 years. Our goal is to estimate the impact on CHD from applying a treatment rule regarding cholesterol that is learned from the data. Define the data of interest to be O=(W,A,Y), where W are a vector of confounders, A the treatment of interest (say cholesterol level) which is dichotomized, so that A=I(Chol>γ), where γ is a target intervention level; Y is the indicator of a CHD event within the study period. Let Qˉ0(A,W)E0(Y|A,W). We start with the ambitious goal of estimating a treatment rule for intervening on cholesterol that targets only those people that would be helped by such an intervention. To do so, we start by defining so-called counterfactuals [5], Ya, which represents the outcome for a subject if, possibly contrary to fact, the subject had level A=a. This leads to a notion of potentially “full” data that includes counterfactuals, or in this case, X=(W,A,Y1,Y0). Our goal is to estimate the impact of an intervention rule that lowers a subjects cholesterol (from A=1 to A=0) only if such a change improves the CHD outcome (that is, lowers cholesterol if Y1=1 and Y0=0 and A=1), or:

(1)E{YYdX,}

with rule dX(Y,Y0)=I(Y<Y0). Beyond the assumptions of randomization, positivity, consistency [6], this parameter would be identifiable only under strong assumptions on the joint distribution of counterfactuals. Thus, we consider a less ambitious parameter, based on average impacts of intervention, conditional on W. In this case, the parameter measures the impact of only targeting those individuals that “significantly” benefit from intervening on those with high cholesterol (A=1), or:

(2)E{YYd0,τ,Q}

where Qˉ0(A,W)E0(Y|A,W) and (A,W)d0,τ,Qˉ0(A,W)=I(Qˉ0(A,W)Qˉ0(0,W)<τ). In comparing the same individuals in a population before and after this rule is imposed, their A will stay as it was unless the original A=1andQˉ0(1,W)Qˉ0(0,W)>τ. Though this parameter is identifiable without the stronger assumptions necessary to identify (1), it still requires the strong assumption of estimating Qˉ0 at a particular rate. Thus, we finally consider examining the impact of an empirically derived rule:

(3)E{YYdn,τ,Qˉn}

where the rule, (A,W)dn,τ,Qˉn(A,W) is as above, but with the true mean function being replace by some empirically derived estimate, Qˉn(A,W); it is a data adaptive parameter, as the data is used to define the parameter.

We will discuss several specific examples below, but note that the sequence of analyses often used in large scale omic studies (genomics, proteomics, metabolomics, etc.; Zhang and Chen [7], Berger et al. [8]) can be the result of a series of suggested patterns that lead to further analyses not previously considered, e. g., multiple testing, to clustering, to exploration of pathways, to more targeted analyses all with the data from the same experiment. In these cases, inference that ignores that the parameters were derived data adaptively will typically be biased. Others have noted the particular dangers of high dimensional data combined with flexible methodologies to generate excessive false positive findings (Ioannidis [9], Broadhurst and Kell [10]). In many cases, even when the best intentions are to stick to a pre-specified data analysis plan, there can be feedback from the data and models chosen (e. g., covariates dropped, different basis functions tried, unplanned sub-group analyses conducted, etc.; Barraclough and Govindan [11], Marler [12]). Thus, it is important to have methods that allow such exploration and also transparent interpretation of the resulting estimates. Though there are advantages for pre-specifying the algorithm used to generate the parameter(s), the general methodology does not even require one to – that is, one can derive inference for methods of deriving patterns, even when the precise methods used to generate the parameters are not known. Thus, it can be applied in circumstances where there is little constraint on how the data is explored to generate potential parameters of interest for estimation and inference.

2 Methodology

The following is also presented in Hubbard and van der Laan [2]. Consider observed data O1,,On, i.i.d. with probability distribution P0, within statistical model ℳ. Let Bn{0,1}n be a random vector of binaries, independent of (O1,,On), that defines a random split into an estimation-sample {Oi:Bn(i)=1} and parameter-generating sample {Oi:Bn(i)=0}. For simplicity, assume that Bn corresponds with V-fold cross-validation scheme, i. e., 1) {1,,n} are divided in V equal size subgroups, 2) an estimation-sample is defined by one of the subgroups, 3) the parameter-generating sample is its complement resulting in V such splits of the sample. Thus, in this case Bn has only V possible values.

Given random split Bn, Pn,Bn0 is the empirical distribution of the parameter-generating sample, and Pn,Bn1 the empirical distribution of the estimation-sample. For a given Bn,ΨBn,Pn,Bn0:MR is the target parameter mapping indexed by the parameter-generating sample Pn,Bn0, and ΨˆBn,P0n,Bn:MNPR the corresponding estimator of this target parameter. Here ℳNP is the nonparametric model and an estimator is defined as a mapping/algorithm from a nonparametric model, including the empirical distributions, to the parameter space. For simplicity, assume that the parameter is real-valued. Thus, the target parameter mapping and estimator can depend not only on parameter-generating-sample Pn,Bn0, but also on the particular split Bn.

The choice of target parameter mapping and corresponding estimator can be informed by the data Pn,Bn0 and split Bn, but not by the estimation-sample Pn,Bn1. One does not need to assume the mapping from the parameter-generating sample to the space of target parameter mappings and estimators is known, but one need only to know its realization (ΨBn,Pn,Bn0,ΨˆBn,Pn,Bn0). Define the sample-split data adaptive statistical target parameter as Ψn:MR with

Ψn(P)=EBnΨBn,Pn,Bn0(P)

and the statistical estimand of interest is thus

ψn,0=Ψn(P0)=EBnΨBn,Pn,Bn0(P0).

This parameter mapping depends on the data and thus it is called a data adaptive target parameter. A corresponding estimator of the estimand ψn,0 is:

ψn=Ψˆ(Pn)=EBnΨˆBn,Pn,Bn0(Pn,Bn1).

The goal is to prove that n(ψnψn,0) converges in distribution to mean zero normal distribution with variance σ2 that can be consistently estimated, allowing the construction of confidence intervals for ψn,0 and also allow testing a null-hypothesis such as H0 : ψn,00. This holds if ψn=Ψˆ(Pn) is an asymptotically linear estimator of ψn,0 with influence curve IC(P0):

ψnψn,0=(PnP0)IC(P0)+oP(1/n);

the notation Pff(o)dP(o) is used for the expectation of f (O) w.r.t. P. Since (PnP0)IC(P0)=1/niIC(P0)(Oi) is the sum of mean zero independent random variables, the asymptotic linearity implies that n(ψnψn,0) converges to a mean zero normal distribution with variance σ2=P0IC(P0)2.

Theorem 1

Suppose that, given(Bn,Pn,Bn0),ΨˆBn,Pn,Bn0is an asymptotically linear estimator ofΨBn,Pn,Bn0(P0)at P0with influence curveICBn,Pn,Bn0(P0)indexed by(Bn,Pn,Bn0):

ΨˆBn,Pn,Bn0(Pn,Bn1)ΨBn,Pn,Bn0(P0)=(Pn,Bn1P0)ICBn,Pn,Bn0(P0)+Rn,Bn,
where (unconditional)Rn,Bn=oP(1/n).Assuming V-fold cross-validation, and for a given splitBn=υ,assume thatP0ICυ,Pn,υ02(P0)P0ICυ(P0))20in probability, whereICυ(P0)is a limit influence curve that can still be indexed by the splitυ.

Then, n(ψnψn,0)=1VυVn/V(Pn,Bn1P0)ICυ,Pn,υ0(P0)+oP(1/n)converges to a mean zero normal distribution with variance

σ2=1Vυ=1Vσυ2,
whereσυ2=P0ICυ2(P0).A consistent estimator ofσ2is given by
σn2=1Vυ=1VPnICυ,n2,
whereICυ,nis anL2(P0)-consistent estimator ofICυ(P0).Alternatively, one can use,
(4)σn2=1Vυ=1VPn,υ1ICυ,Pn,υ0(Pn,υ0)2,
whereICυ,Pn,υ0(Pn,υ0)is anL2(P0)-consistent estimator ofICυ,Pn,υ0(P0)based on the samplePn,υ0.

The latter variance estimator avoids finite sample bias by using sample splitting and might therefore be preferable in finite samples. The proofs of theorems are provided in the Supplemental Material.

Asymptotic equivalence of standardized estimator and standardized oracle estimator Suppose that the algorithm (Bn,Pn,Bn0)(ΨBn,Pn,Bn0,ΨˆBn,Pn,Bn0) that maps the data and choice of sample split into an estimator and target-parameter mapping does not depend on the particular split Bn. This would be true if, for instance, a fixed algorithm was used to generate target parameters. In that case, the influence curve ICBn,Pn,Bn0(P0), conditional on the parameter-generating sample Pn,Bn0 and split Bn, will converge to a fixed IC(P0), which does not depend on the split. In this important case, the estimator ψn of ψn,0 is asymptotically linear with influence curve IC(P0), which is the influence curve of the estimator ΨˆP0:MNPIR of the target parameter ΨP0:MIR, treating P0 as known, leading to the limit-variance:

σ2=P0IC(P0)2.

In addition, the standardized estimator n(ψnψ0,n) has the same asymptotic variance as the standardized “oracle” estimator n(ΨˆP0(Pn)ΨˆP0(P0)) (that is an estimator of an a priori specified parameter, as opposed to a data adaptive one) one would have used for the parameter ΨP0(P0) if the parameter mapping ΨP0 is treated as known. Even though there was no loss in efficiency relative to this oracle procedure ΨˆP0(Pn), we should note that this asymptotic variance is measured relative to a different target EBnΨPn,Bn0(P0) instead of ΨˆP0(P0). Finally, we provide heuristics for choosing the number of splits in Supplemental Material.

2.1 Splitting the sample, but using the whole sample to fit the data adaptively generated target parameter

In the above Theorem 1, one need not assume Donsker class conditions, so that the target-parameter choices ΨBn,Pn,Bn0 could be arbitrarily dependent on the data Pn,Bn0. However, now consider an estimator ψn1EBnΨˆBn,Pn,Bn0(Pn) of the same “estimand” ψ0,n but which uses the entire sample as the estimation sample for each of the V parameter-generating samples. The asymptotics will now rely on stronger assumptions, but if the algorithm generating the target parameter and estimator is different across splits, and the stronger assumptions are satisfied, then the estimator is generally more efficient than the algorithm based on theorem 1.

Theorem 2

As above assume that conditional on(Bn,Pn,Bn0),ΨˆBn,Pn,Bn0is asymptotically linear with influence curve ICBn,Pn,Bn0(P0) so that

ΨˆBn,Pn,Bn0(Pn)ΨBn,Pn,Bn0(P0)=(PnP0)ICBn,Pn,Bn0(P0)+Rn,Bn,
where (unconditionally)Rn,Bn=oP(1/n).Also, as in Theorem 1, for a given splitBn=υ,assume thatP0ICυ,Pn,υ02(P0)P0ICυ(P0))2n0in probability, whereICυ(P0)is a limit that can still be indexed by the splitυ.

We also assume that ICυ,Pn,υ0(P0) falls in a P0-Donsker class with probability tending to 1.

Then,

ψn1ψn,0=(PnP0)IC(P0)+oP(1/n),
where
IC(P0)1Vυ=1VICυ(P0)
is an average of the Bn-specific influence curves. Thus, n(ψn1ψn,0)converges to a mean zero normal distribution with variance
σ12=P01VvICvP02.

The relative efficiency of the two estimators ψn and ψn1 is of course based on the two corresponding asymptotic variances

σ2=1Vυ=1Vσυ2andσ12=1V2υ1,υ2P0{ICυ1(P0)ICυ2(P0)}.

In the special case that ICυ=IC does not depend on the split υ (i. e., the algorithm generating a target parameter and estimator is the same for each split), then σ2=σ12. In the other extreme case that P0ICυ1ICυ2=0 for υ1υ2,σ2=1/Vυσυ2 and σ12=1V2υσυ2. Thus, in the latter case σ2=Vσ12 and one can conclude that if the selected target parameters across the V parameter-generating samples are highly correlated, then the estimator ψn is almost as efficient as ψn1, but if the selected target parameters across different sample splits are highly independent/orthogonal, then a very significant loss in efficiency up till a factor V can occur. This efficiency comparison does not take into account that ψn is asymptotically normally distributed under significantly weaker conditions than the conditions needed for asymptotic linearity of ψn1, so that there will be cases under which the model required for asymptotic normality of ψn holds, but the analogue model for ψn1 fails to hold. This comparison also does not take into account that ψn1 should have better second order term behavior than ψn for non-linear estimators, since ψn1 involves using the full sample for each of the data adaptively generated target parameters.

2.2 Using the whole sample to generate the target parameter and to subsequently estimate it: no sample splitting

Consider a mapping Pn(ΨPn,ΨˆPn) from a sample to a target parameter mapping ΨPn:MR and corresponding estimator ΨˆPn:MNPIR. The estimand of interest is now ΨPn(P0) and it is estimated with ψn2=ΨˆPn(Pn). The possible advantage of this approach is that the estimand is a single parameter instead of an average over splits of sample-split-specific estimands, and the latter might be harder to interpret. However, as in the previous subsection, stronger conditions are needed to establish the desired asymptotic consistency and normality. In contrast to the method of the previous subsection, in which we only changed the estimator, we now actually changed the estimand as well.

Theorem 3

AssumeΨˆP(Pn)is an asymptotically linear estimator ofΨP(P0)at P0with influence curve ICP (P0) uniformly in the choice of parameter P in the following sense:

ΨˆPn(Pn)ΨˆPn(P0)=(PnP0)ICPn+Rn,
whereRn=oP(1/n).In addition, assumeP0(ICPn(P0)ICP0(P0))20in probability andICPn(P0)is an element of a P0-Donsker class with probability tending to 1. Then,
Ψ^Pn(Pn)Ψ^Pn(P0)=(PnP0)ICP0(P0)+oP(1/n),
and thusn(ψn2ΨˆPn(P0))is asymptotically normally distributed with mean zero and varianceσ2=P0ICp0P0.

Again, this estimator ψn2 is as efficient as the oracle estimator ΨˆP0(Pn) as an estimator of ΨP0(P0), discussed above, but one should note again that its efficiency is measured relative to a different target ΨPn(P0) instead of ΨP0(P0). Since the parameter ΨP0 is unknown while ΨPn is a known target parameter mapping, one might often find the parameter ΨPn(P0) more tangible than ΨP0(P0), and thus perhaps easier to interpret. In essence, theorem 3 provides the conditions necessary for consistent inference of a “data-dredging” algorithm; using the same data for generating the parameter of interest, and deriving its inference.

Note, that others have examined the asymptotic of such a procedure (no sample splitting) using different theoretical approaches. For instance, Dwork et al. [3] presents asymptotic consistency for estimating a number (m) of data-adaptively derived functions of the data-generating distribution as a function of m and sample size, n.

3 Examples

In this section we showcase a few examples to demonstrate the proposed procedures for generating statistical target parameters and corresponding estimators and confidence intervals. For longer list of examples, see Supplemental Material.

3.1 Inference for the sample-split conditional risk of a data adaptive regression estimator

The set-up is identical to Dell et al. [13], Dudoit and van der Laan [14]. Let O=(W,Y)P0, where W is a vector of input-variables and Y is an outcome one wants to predict; P0 is composed of the distribution of Y|W,Q0(W) and the distribution of W, Q0,W. Let Q¯^ be an estimator of the true regression function Qˉ0=E0(Y|W); let QˉPn,Bn0 be the corresponding estimate of Qˉ0=E0(Y|W) based on the parameter-generating sample Pn,Bn0. The target parameter generated by Pn,Bn0 is defined as the mean squared error ΨPn,Bn0(P0)=E0(YQˉPn,Bn0(W))2 or, in general, as the loss-function specific risk E0L(QˉPn,Bn0)(W,Y) for some loss function L(Qˉ) satisfying Qˉ0=argminQˉE0L(Qˉ).

The estimator of ΨPn,Bn0(P0) based on the estimation sample Pn,Bn1 is defined as its empirical counterpart ΨˆPn,Bn0(Pn,Bn1)=Pn,Bn1L(QˉPn,Bn0). Conditional on the sample Pn,Bn0, this estimator ΨˆPn,Bn0(Pn,Bn1) is asymptotically linear with influence curve L(QˉPn,Bn0)P0L(QˉPn,Bn0) with no remainder. The average across sample-split data adaptive target parameters is thus defined as ψn,0=EBnP0L(QˉPn,Bn0) and its corresponding estimators are ψn=EBnPn,Bn1L(QˉPn,Bn0),ψn1=EBnPnL(QˉPn,Bn0), and ψn2=PnL(QˉPn). Theorem 1 implies that if the loss function chosen is uniformly bounded and the estimator Q¯^(Pn) is consistent for a limit Qˉ (not necessarily Qˉ0), then ψnψn,0 is asymptotically linear with influence curve L(Qˉ)P0L(Qˉ), the same influence curve as the estimator PnL(Qˉ) of P0L(Qˉ) treating Qˉ as known. This allows us to construct a confidence interval for the true conditional risk ψn,0, under these very weak conditions. In particular, the estimator Q¯^ can be a highly data adaptive super learner van der Laan et al. [15].

Similarly, Theorem 2 implies a formal result for ψn1, but now L(QˉPn) has to be an element of a P0-Donsker class with probability tending to 1, putting some constraints on how adaptive QˉPn can be. Under the same conditions, we will have that ψn2=PnL(QˉPn) is an asymptotically linear estimator of P0L(Qˉ) with the same influence curve ψn Even though these conditions might be satisfied for Qˉn, the estimator ψn2 is known to be wrong for the sake of using PnL(QˉPn) to select among a collection of candidate estimators of Qˉ0 since this estimator of risk will favor over-fitted estimators. Nonetheless, if the goal is to obtain confidence intervals for the asymptotic risk P0L(QˉPn) of an estimator QˉPn, then this method could be considered.

3.2 Inference for sample-split subgroup-specific causal effect, where the subgroups are data adaptively determined

Consider “discovering” sub-groups within the target population that have unique relationships with explanatory variable of interest (e. g., drug treatment, environmental exposure, etc.). In the case that these sub-groups are not defined apart from the data, post-hoc sub-group analysis is typically treated as purely explanatory and thus the statistical inference inherently awed, typically anti-conservatively. However, the approach we have outlined provides explicit framework for aggressively searching for interesting sub-groups, but still allows for consistent statistical inference for the resulting estimators of association parameters.

Suppose that we observe on each subject O=(W,A,Y), where W are baseline covariates, A is a binary treatment, and Y a final outcome. Thus we observe n i.i.d. copies O1,,On, and consider an algorithm that maps a data set O1,,On into a subgroup WC(W){0,1}, where C(W)=1 indicates membership in the subgroup. Denote this subgroup-estimator with Cˆ:MNPC, where C is the space of functions that map a W into a binary indicator. Given a realized subgroup C, let ΨC:MIR be a desired parameter of interest such as the W-controlled effect of treatment A on Y for subgroup C, defined as

ΨC(P0)=E0{E0(Y|A=1,W,C(W)=1)E0(Y|A=0,W,C(W)=1)|C(W)=1}.

Let ΨˆC:MNPIR be an estimator of ΨC(P0), again where one could choose among several different estimators (including the IPTW; Robins et al. [16]), but we focus on TMLE [17, 18]: note that this is just the targeted maximum likelihood estimator for the W-controlled effect of treatment but applied to the subsample {i:C(Wi)=1}. Assume that the regularity conditions hold so that this TMLE ΨC(Pn) is asymptotically linear with influence curve ICC(P0):

ΨˆC(Pn)ΨC(P0)=(PnP0)ICC(P0)+RC,n,

where RC,n=oP(1/n).

Define ΨPn0,Bn:MIR as ΨPn0,Bn=ΨCˆ(Pn,0Bn), i. e., the W-controlled effect of treatment on the outcome for the data adaptively determined subgroup C^(Pn0,Bn). Similarly, we define ΨˆPn0,Bn:MNPIR as ΨˆPn,0Bn=ΨˆCˆ(Pn0,Bn), i. e. the TMLE of the W-controlled effect of treatment on the outcome for this data adaptively determined subgroup, treating the latter as given. The estimand of interest is thus defined as ψn,0=EBnΨPn0,Bn(P0) and its estimator is ψn=EBnΨ^Pn0,Bn(Pn,1Bn). That is, for a given split Bn, we use the parameter-generating sample Pn,Bn0 to generate a subgroup Cˆ(Pn,Bn0) and corresponding TMLE of ΨˆCˆ(Pn,Bn0)(P0) applied to the estimation-sample Pn,Bn1, and these sample-split specific estimators are averaged across the V sample splits. By assumption we have for each split Bn

ΨˆCˆ(Pn,Bn0)(Pn,Bn1)ΨCˆ(Pn,Bn0)(P0)=(Pn,Bn1P0)ICCˆ(Pn,Bn0)(P0)+RCˆ(Pn,Bn0),

where we now assume that (unconditionally) RC^(Pn,Bn0)=oP(1/n). In addition, we assume that P0{ICCˆ(Pn,Bn0)(P0)}2 converges to P0{ICCˆ(P0)(P0)}2 for a limit subgroup Cˆ(P0). Application of Theorem 1 now proves that ψnψn,0 is asymptotically linear with influence curve ICCˆ(P0)(P0) so that it is asymptotically normally distributed with mean zero and variance σ2=P0ICCˆ(P0)(P0)2.

Under the Donsker class condition on ICCˆ(Pn,Bn0)(P0) we can also establish the formal results for ψn1=EBnΨˆCˆ(Pn,Bn0)(Pn) of ψn,0, and the estimator ψn2=ΨˆCˆ(Pn)(Pn) of ΨCˆ(Pn)(P0), respectively.

4 Simulations

Simulations for different algorithms producing the data adaptive target parameters were examined for performance among the three different algorithms based on theorems 1, 2 and 3 (referred to as algorithms 1, 2 and 3). The following step provides the structure of the algorithm, but also provides some basis of understanding the data adaptive parameter.

  1. Generate a random sample from the data generating distribution of size n and break into V equal size estimation samples of size nV=n/V with corresponding parameter generating samples of size nn/V.(

  2. For each parameter-generating sample, apply the data-adaptive algorithm to define the parameter to be estimated on the corresponding estimation sample, which defines ΨPn,Bn0. For instance, fit a data-adaptive regression procedure estimating the mean of outcome Y based on predictors W, say QˉPn,Bn0(W), and define the target parameter as the risk based on squared error loss defined as ΨPn,Bn0(P0)=EP0(YQˉPn,Bn0(W))2, treating QˉPn,Bn0 as fixed and known.

  3. For each of the V estimation samples, estimate the data adaptive parameter. For example, in the case of the risk example described in 2., ΨˆPn,Bn0(Pn,Bn1)=EPn,Bn1(yQˉPn,Bn0(W))2. In addition, derive the influence curve ICBn,Pn,Bn0() of this estimator for each of the sample-splits.

  4. To derive the value of the true parameter corresponding to each parameter-generating sample, we draw a very large sample using the same distribution, representing a target population (P0). This is used to evaluate ΨPn,Bn0(P0)=E0(yQˉPn,Bn0(W))2, where P0 is approximated by this empirical probability distribution of this very large sample (100,000).

  5. Estimate the asymptotic variance (4) of ψn based on the sample variance within estimation samples of ICBn,Pn,Bn0() (see Theorem 1 above), and construct a corresponding Wald-type confidence interval.

  6. Repeat 1–5 for 1,000 simulations, examine the distribution of standardized differences, n(ψnψn,0), and determine the coverage probabilities for the confidence intervals.

The modifications for algorithms 2 and 3 follow from the respective theorems.

4.1 Risk estimation of a data adaptive prediction algorithm

More background on the conditional risk parameter is in Section 3.1, and in this case the goal is estimation and inference regarding the “fit” of a machine learning algorithm. The data is O=(Y,W), for outcome Y, predictor W, where WN(0,σW2=4),Qˉ0(W)E0(Y|W) is shown in Figure 1, based on a piecewise constant model, and Y|WN(Qˉ0(W),σY2=0.25). For the υth parameter-generating sample, we fit the regression with an ensemble stacking algorithm, called the SuperLearner (SL; van der Laan et al. [15]), resulting in a convex combination of a variety of algorithms ranging from very smooth to highly data adaptive: linear model, stepwise regression based on AIC (stepAIC; Venables and Ripley [19]), Bayesian glm (linear) model (bayesglm; Gelman et al. [20]), generalized additive model with smooth term for covariate Hastie and Tibshirani [21]; neural nets (nnet; Venables and Ripley [19]); and a simple null model (sample average of outcome). For the υth parameter-generating sample the data adaptive parameter of interest was defined as the conditional risk (mean squared error; MSE), conditional on the fitted prediction function: ΨPn,Bn0(P0)E0[(YQˉPn,Bn0(W))2] is the true expected squared error loss of SL fit (based on parameter-generating sample, Pn,Bn0) and estimated using corresponding validation sample: ΨˆPn,Bn0(Pn,Bn1)=EPn,Bn1[(YQˉPn,Bn0(W))2]. Thus, the estimand is the risk averaged over the V estimation samples: ΨPn(P0)=EBnΨPn,Bn0(P0) and the corresponding estimator is Ψˆ(Pn)=EBnΨˆPn,Bn0(Pn,Bn1). Finally, inference is derived based on (4) above, where the estimated influence curve for the v-th estimation sample is given by

ICPn,Bn0,n=(YQˉPn,Bn0(W))2ΨˆPn,Bn0(Pn,Bn1).

This is repeated for sample sizes of n=100,500,1000, using algorithm 1, 2 and 3.

Figure 1: True model Qˉ0(W)${\bar Q_0}(W)$ for simulations of conditional risk estimation.
Figure 1:

True model Qˉ0(W) for simulations of conditional risk estimation.

4.1.1 Results

We examined the empirical distribution of the standardized differences, (ψnψn,0)/se(ψn) for the risk. We observe minimal departure from normality (Figure 2), and nearly perfect coverage probability of the confidence intervals, for all sample sizes for algorithms 1 and 2 (see Table 1).However, one can see that algorithm 3, though resulting in a lower “average” risk, results in biased inference (and non-normal sampling distribution for the relatively modest sample sizes. This implies the Donsker conditions are not met. by a sample size of n=1000, though there is still some standardized bias, the coverage probability is nearly perfect. Thus, even for a highly adaptive algorithm, where overfitting (underestimation of risk) seems particularly troublesome, at modest sample sizes, one begins achieving the conditions of theorem 3.

Figure 2: Distribution of (ψn−ψn,0)/se(ψn)${{({\psi _n} - {\psi _{n, 0}})} \mathord{\left/{\vphantom {{({\psi _n} - {\psi _{n, 0}})} {se({\psi _n})}}} \right. \kern\nulldelimiterspace} {se({\psi _n})}}$ (for n=100,500,1000$n = 100, \,500, \,1000$ with N(0, 1) distribution for three algorithms (1, 2 and 3, corresponding to the 3 rows). Dark line represents the mean of these standardized values, so the difference between it and 0 is the standardized bias.
Figure 2:

Distribution of (ψnψn,0)/se(ψn) (for n=100,500,1000 with N(0, 1) distribution for three algorithms (1, 2 and 3, corresponding to the 3 rows). Dark line represents the mean of these standardized values, so the difference between it and 0 is the standardized bias.

Table 1:

Simulation results for risk estimation for data adaptive prediction on theorems 1–3 (ψn,ψn1,ψn2, respectively). Coverage probability is for a nominal 95 % CI.

NMethodsAverage true parameterAverage estimatedCov. Prob.
n=100ψn0.550.550.93
ψn10.540.540.92
ψn20.450.400.78
n=500ψn0.550.550.94
ψn10.550.550.94
ψn20.460.460.92
n=1000ψn0.400.400.95
ψn20.400.410.95
ψn20.360.360.93

We also examined the same procedure for estimating the risk difference using algorithm 2. In this case, we observe slower convergence, but still relatively good coverage for an estimate that is particularly sensitive to over-fitting.

4.2 Average treatment effect for given prediction model

Average Treatment Effect, or ATE, is commonly the parameter of interest in applications of causal inference methods (Rubin [5]). We use the same set-up as we did for the example in the introduction, but for the ATE: E(Y1Y0), where (Y1, Y0) are the counterfactual outcomes for an individual unit if they have A=1 and A=0, respectively. Consider n i.i.d. observations of O=(W,A,Y)P0, where Y is an outcome, A is a binary treatment of interest, and W a set of potential confounders. Under the assumption, the ATE equals the following statistical estimand:

ATEE0,W{E0(Y|A=1,W)E0(Y|A=0,W)}.

Let Qˉ0(a,W)E0(Y|A=a,W), and assume that Qˉ0 is known. Then, the estimate of the ATE would be:

ATEˆ=1ni=1n{Qˉ0(1,Wi)Qˉ0(0,Wi)}.

Given an estimator of Qˉ0 on each of the training samples, we calculate the resulting data-adaptive ATE on the corresponding validation samples, and take the average, to derive our parameter of interest:

(5)ΨPn(P0)=EBnEP0{QˉPn,Bn0(1,W)QˉPn,Bn0(0,W)},

The data generating distribution for this simulation is defined by WN{0,var0(W)=4},A|W is binomial with logit{P(A=1|W)}=4+2W and Y|(W,A)N{Qˉ0(A,W),var0(Y|A,W)=0.25}, where Qˉ0(a,W) is shown in Figure 3.

Figure 3: E0(Y|A=a,W)${E_0}(Y|A = a, \,W)$ for the ATE simulations.
Figure 3:

E0(Y|A=a,W) for the ATE simulations.

To derive QˉPn,Bn0, we use the SuperLearner (SL) based upon the following learners: linear model, stepwise regression based on AIC (stepAIC; Venables and Ripley [19]), Bayesian glm (linear) model (bayesglm; Gelman et al. [20]), generalized additive model with smooth term for covariate Hastie and Tibshirani [21]; neural nets Venables and Ripley [19]; and a null model (intercept only).

Table 2:

Simulation results for ATE for the algorithms based on theorems 1–3 (ψn,ψn1,ψn2, respectively). Coverage probability is for a nominal 95 % CI.

MethodsAverage true parameterAverage estimatedMSECov. Prob.
ψn1.710.890.821.19
ψn11.710.890.820.51
ψn21.680.890.790.63

We applied both algorithms 1–3 for sample size of n=500.

4.2.1 Results

Examining the empirical distribution of the standardized differences, (ψnψn,0)/se(ψn) for the ATE parameter looked close to standard normal sampling distributions for both all algorithms 1–3 (not shown). Table 2 shows the results of the simulations based on both algorithms 1 and 3, and as one can see, the estimation is unbiased, and the coverage of confidence intervals based IC-based estimates of the standard errors is close to perfect. Though algorithms 1 and 3 produced different data adaptive target parameters and corresponding estimators, due to the linearity of the estimator ΨPn,Bn0 (i. e., it is just a difference in sample means), ψn and ψn2 have the same MSE. This implies that even in this case, where very adaptive estimators are used, the Donsker class assumptions hold, as the confidence intervals have the nominal coverage.

4.3 Variable reduction

We consider a situation that has an analogue in high dimensional ‘omic data, where multiple testing is often done to target a relatively small subset of (for instance) genes among the tens of thousands of candidates. The method evaluated in this simulation uses the parameter-generating sample to selects a small subset of the original genes, and subsequently it uses the estimation sample to estimate the effect of these genes on some phenotype. In this manner, it avoids the need to apply multiple testing procedures that control a type-I error rate among very large number of tests.

Let O=(A,Y=(Y1,Y2,,Yp)) where A is binary vector of zeros and ones (indicating, for instance, phenotype), and Y is a multivariate outcome. Consider an algorithm that maps a data set O1,,On into a subset C{1,,p} of genes, where we denote this subset-estimator as Cˆ:MNPC, where C is the set of p–dim vectors with components in {0, 1}, so that C={0,1}p. We define our data adaptive parameter as:

(6)Ψn(P0)=EBn{E0(Y(Cˆ(Pn,Bn0))|A=1)E0(Y(Cˆ(Pn,Bn0))|A=0)},

where

Y(Cˆ(Pn,Bn0))=1|Cˆ|jI(Cˆ(j)=1)Yj

is an average of the gene-expression across a subset (cluster) of genes, where this subset is determined by a procedure on the parameter-generating sample.

The estimator of (6) based on the estimation sample is simply

ΨˆPn,Bn0(Pn,Bn1)=EBn{EPn,Bn1(YPn,Bn0|A=1)E(YPn,Bn0|A=0)}

and its influence curve is estimated as follows

(7)ICPn,Bn0(Y,A)=I(A=1)Pn,Bn1(A=1)I(A=0)Pn,Bn1(A=0)(YEPn,Bn1(Y|A)).

To investigate this estimator, we simulate based on a design where there are equal numbers of A=0 and A=1; for each (gene) j, the distribution of Yj, given A, is defined by the following regression equation

(8)Yj=B0j+B1jA+ejj=1,,p.

The coefficients (the B0j,B1j were generated by a multivariate normal distribution with E(B0)=E(B1)=0 and a variance covariance matrix with Cov(B0,B1)=1i=j and Cov(B0,B1)=.2ij. Note, that these coefficients are fixed in the simulation, not random, so this is just a convenient mechanism to generate a distribution of effect sizes, B1j for which there is a true ranking based on the resulting P0. The errors ej were independent draws from a random N(0,σe2) distribution, and we repeated the simulation both for different magnitudes of the residual error, (different σe2) but also for increasing sample sizes. The function that defines the subset of genes is simply based on ranking the genes by Bˆ1j=EPn0,Bn(Yj|A=1)EPn0,Bn(Yj|A=0), and then Cˆ(Pn,Bn0)) is the indicator that a genes is in the top 15. Thus, this is a large variable-reduction exercise, where we examine the association of the average gene expression and phenotype of a data-adaptively selected subset of genes. The same procedure for deriving the data adaptive target parameter and estimator is repeated for all 3 algorithms, with the corresponding methods for deriving the inference via the influence curve carried out as described above.

4.3.1 Results

The results of the simulation are shown in Table 3 for the set with σe2=2. In this case, we observe very good performance with regards to coverage probability for algorithm 1, even at relatively modest sample sizes. On the other hand, for algorithms two and three, which have an overlap in their parameter-generating and estimation samples, the confidence intervals have nominal at larger sample sizes (n=10,000), so the apparent violation of the conditions at smaller samples sizes(n2,000) for this very adaptive procedure, is not a violation at still relatively modest sample sizes. However, algorithm 1 shows very good performance at all by the smallest sample size, with regards to statistical inference, while not having greater sampling variability; thus, algorithm 1, all things being equal, is the safer choice.

5 Data analysis: Data adaptive estimation of the impact of interventions on cholesterol in WCGS study

We re-visit the original example discussed in the introduction: estimating the impact of the targeted treatment of cholesterol on rates of coronary heart disease (CHD) (3). Again, data is O=(W,A,Y) with outcome Y (CHD), variable of interest A (total cholesterol; A is indicator of total cholesterol >180mg/DL) and covariates W (age, weight, height, smoking, behavior type, etc.). We estimate a parameter akin to (3), using algorithm 1. Like the ATE example in Section 4.2, our parameter and corresponding estimator (on the validation sample) is defined via an estimate of the estimated regression of Y on A, W using the training sample to define the rule, dτ,QˉPn,Bn0:

(9)Ψn,Pn,Bn0(P0)=E(YYdτ,QˉPn,Bn0)=EYEYdτ,QˉPn,Bn0
Table 3:

Simulation results for Variable Reduction for the algorithms based on theorems 1–3 (ψn,ψn1,ψn2, respectively). Coverage probability is for a nominal 95 % CI.

NMethodsAverage true parameterAverage estimatedCov. Prob.
n=100ψn2.242.250.83
ψn12.242.580.0060
ψn22.632.570.014
n=500ψn2.482.480.91
ψn12.482.550.66
ψn22.492.560.69
n=1,000ψn2.512.510.92
ψn12.512.580.77
ψn22.512.570.80
n=2,000ψn2.562.560.94
ψn12.562.580.86
ψn22.562.580.88
n=10,000ψn2.512.510.96
ψn12.512.520.95
ψn22.512.520.94

where, as above, QˉPn,Bn0 is the regression estimator on training sample. The estimator for (9) is the sample average minus the targeted maximum likelihood estimator (TMLE; van der Laan and Rose [17]) of the rule-specific mean. For the estimate of regression QˉPn,Bn0 used to define the treatment rule, we used the ensemble machine learning method, SuperLearner [22, 23]. The learners included both very simple and more potentially complex, adaptive models: 1) fixed mean model, 2) main-terms logistic regression, 3) Stepwise logistic Regression with 2-way interactions, 4) Generalized Additive Model [21] with smooths for all non-factor covariates, 5) neural nets [24], 6) penalized regression using glmnet [25], and 7) nearest neighbor [24]. SL is itself based 10-fold cross-validation within the parameter-generating sample. On the corresponding estimation sample, the TMLE estimator also requires a regression of Y on (A, W) and we also used the SuperLearner with a similar set of learners; an estimate of the so-called treatment mechanism (g0(W)P(A=1|W)) is also required and main terms logistic regression was used. To derive inference, we need the plug-in influence curve (IC) for the estimation-sample-specific estimator. In this case:

ICPn,Bn0(Pn,Bn1)=Y[I{A=dτ,Q¯Pn,Bn0(A,W)}gPn,Bn1(W){YQ¯Pn,Bn1(A,W)}+Q¯Pn,Bn1{dτ,Q¯Pn,Bn0(A,W),W}]ΨPn,Bn0(Pn,Bn1)

where (QˉPn,Bn1,gPn1,Bn) represent the estimators of (Qˉ0,g0) on the validation sample. For our specific implementation, we used an arbitrary cut-off for “significant” improvement from lowering cholesterol from the current level as an reduction in risk of CHD of greater than 2.5 % (τ=0.025). Besides the estimate of each of the training-sample specific parameters, we also estimate the average of these across the V=10 folds as: Ψn(P0)=EBnΨPn,Bn0(P0), where the standard error of the estimate was calculated as (4). In addition, an equivalent estimate of the change in CHD rate if cholesterol was lowered in all subjects was done for comparison. The goal of the estimation is to determine whether one can target fewer people, but still not sacrifice much increase in the overall CHD rate.

5.1 Results

The results (Figure 4) suggest one would reduce the risk of CHD by 3.1 % (95 % CI=2.33.9%), by using the derived rule (which targets about 44 % of the population to reduce cholesterol from the observed value). However, estimating the impact of targeting all those with cholesterol >180mg/dL (based on equivalent data-adaptive estimator) results in intervention in nearly double the population (A=1 in 86 % of sample), and in reduction of CHD rate of 4.6 %, 95 % CI=(3.75.5%). We then followed-up by estimating (9) using algorithms based on theorem 2 and 3, resulting in the estimates of 3.9 %, (95 % CI=(2.25.5%)) and 3.0 %, (95 % CI=(1.14.5%)), respectively, both which result in similar estimates and inference. This shows the potential of using the data-adaptive parameter approach when one has parameters that are complicated functions of unknown parts of the data-generating distribution. The fact that one also gets trustworthy inference for such an adaptive parameter makes this general approach a very compelling option to consider for circumstances where a non-adaptive parameter is impractical to estimate or requires large parametric assumptions to identify.

Figure 4: Forest plot of estimates (with 95 % CI’s using influence curve-based standard errors) of validation-sample-specific and average data adaptive parameter (9), estimated using the WCGS data, with the intervention on cholesterol, and targeted group being those with an estimated average benefit due to lower cholesterol of a >2.5%$ \gt 2.5 \% $ reduction (τ=0.025)$(\tau = 0.025)$ in estimated CHD. “Summary” is the estimate of the average across the validation-specific estimates, or EBnΨPn,Bn0(Pn,Bn0).${E_{{B_n}}}{\Psi _{P_{n, {B_n}}^0}}(P_{n, {B_n}}^0). $
Figure 4:

Forest plot of estimates (with 95 % CI’s using influence curve-based standard errors) of validation-sample-specific and average data adaptive parameter (9), estimated using the WCGS data, with the intervention on cholesterol, and targeted group being those with an estimated average benefit due to lower cholesterol of a >2.5% reduction (τ=0.025) in estimated CHD. “Summary” is the estimate of the average across the validation-specific estimates, or EBnΨPn,Bn0(Pn,Bn0).

6 Conclusion

Significant scientific progress can be made by generating target parameters based on past studies, and evaluating them on future, independent data. We discussed above, however, how such costly splitting of data is potentially unnecessary; the proposed data adaptive target parameter and corresponding statistical procedure studied in this article allows for general sample splits, and averaging the results across such splits. The theoretical and simulation results demonstrate that statistical inference is preserved under minimal conditions, even though the estimators are now based on all the data. To obtain valid finite sample inference it is important to utilize our corresponding variance estimator (4), and that the sample size for the estimation sample is chosen large enough so that the second order terms of a possible non-linear estimator are controlled.

We also showed that if the algorithm that generates the target parameter is not too adaptive to small changes in the data, then no sample splitting is necessary. Specifically, if the set of influence curves generated by this parameter-generating algorithm when applied to an empirical distribution is a P0-Donsker class, then statistical inference based on the method ψn2 hat uses all the data to both generate the parameter and the estimate it is asymptotically valid. Thus, it provides a theorem for estimation and inference for so-called data-dredging. There are a large variety of data-mining applications where consistent estimation and inference are possible, including using the data to fit a finite dimension vector of coefficients that deterministically identifies a target parameter of interest. If the sample size is large and/or the parameter generating algorithm is well understood so that our Theorem 3 can be formally applied, then algorithm 3 should be considered as an important method.

We have demonstrate that data adaptive target parameter framework provides a formalized approach for estimating target parameters that are either very hard or impossible to pre-specify. There are many examples of interest that have not been highlighted in this article, where the motivation can come from dimension reduction, or complex causal parameters. There are few constraints on how one uses the data to define interesting parameters and we expect their are many applications in Big Data situations for which this approach is particularly well-suited.

Funding statement: Patient-Centered Outcomes Research Institute (PCORI) Pilot Project Program Award, (Grant/Award Number: ‘ME-1306-02735. DISCLAIMER: All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology. Committee); National Institute of Health, (Grant/Award Number: ‘R01AI074345-06A1’); National Institutes of Environmental Health Sciences, (Grant/Award Number: ‘Grant number P42ES004705).

References

1. Witten IH, Frank E. Data mining: practical machine learning tools and techniques, 3rd ed. Burlington, Massachusetts: Morgan Kaufmann, 2011.Search in Google Scholar

2. Hubbard A, van der Laan M. Mining with inference: data-adaptive target parameters. In: P Buhlmann, P Drineas, M Kane, M van der Laan, editors. Handbook of big data, Handbook of Modern Statistical Methods. Boca Raton, FL: CRC Press, 2016:439–52.Search in Google Scholar

3. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving statistical validity in adaptive data analysis. arXiv preprint arXiv:1411.2664, 2014.10.1145/2746539.2746580Search in Google Scholar

4. Ragland DR, Brand RJ. Coronary heart disease mortality in the Western Collaborative Group Study. Follow-up experience of 22 years. Am J Epidemiol 1988;127:462–75.10.1093/oxfordjournals.aje.a114823Search in Google Scholar PubMed

5. Rubin DB. Bayesian inference for causal effects: the role of randomization. Ann Stat 1978;6:34–58.10.1214/aos/1176344064Search in Google Scholar

6. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003.10.1007/978-0-387-21700-0Search in Google Scholar

7. Zhang F, Chen JY. Data mining methods in omics-based biomarker discovery. Methods Mol Biol 2011;719:511–26. doi: 10.1007/978-1-61779-027-0_24.Search in Google Scholar PubMed

8. Berger B, Peng J, Singh M. Computational solutions for omics data. Nat Rev Genet May 2013;14:333–46. doi: 10.1038/nrg3433.Search in Google Scholar PubMed PubMed Central

9. Ioannidis JP. Why most discovered true associations are inflated. Epidemiology Sep 2008;19:640–8. doi: 10.1097/EDE.0b013e31818131e7.Search in Google Scholar PubMed

10. Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2006;2:171–96.10.1007/s11306-006-0037-zSearch in Google Scholar

11. Barraclough H, Govindan R. Biostatistics primer: what a clinician ought to know: subgroup analyses. J Thor Oncol 2010;5:741.10.1097/JTO.0b013e3181d9009eSearch in Google Scholar PubMed

12. Marler JR. Secondary analysis of clinical trials – a cautionary note. Prog Cardiovas Dis 2012;54:335–7.10.1016/j.pcad.2011.09.006Search in Google Scholar PubMed

13. Le Dell E, Petersen M, van der Laan MJ. Computationally efficient confidence intervals for cross-validated area under the roc curve estimates. Technical report, U.C. Berkeley Division of Biostatistics Working Paper Series, http://www.bepress.com/ucbbiostat/paper304, 2012.Search in Google Scholar

14. Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005;2:131–54.10.1016/j.stamet.2005.02.003Search in Google Scholar

15. van der Laan MJ, Polley EC, Hubbard AE. Superlearner. Stat Appl Genet Mol Biol 2007a;6.10.2202/1544-6115.1309Search in Google Scholar PubMed

16. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60.10.1097/00001648-200009000-00011Search in Google Scholar PubMed

17. van der Laan M, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

18. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006;2. URL http: //www.bepress.com/ijb/vol2/iss1/11. Article 11.10.2202/1557-4679.1043Search in Google Scholar

19. Venables WN, Ripley BD. Modern applied statistics with S, 4th ed. New York: Springer, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.10.1007/978-0-387-21706-2Search in Google Scholar

20. Gelman A, Su Yu.-S, Yajima M, Hill J, Grazia Pittau M, Kerman J, Zheng T. Arm: data analysis using regression and multilevel/hierarchical models, 2012. URL http://CRAN. R-project.org/package-arm. R package version 1.5-08.Search in Google Scholar

21. Hastie TJ, Tibshirani RJ. Generalized additive models. New York: Chapman and Hall, 1990.Search in Google Scholar

22. Polley E, van der Laan M. SuperLearner: Super Learner Prediction, 2012. URL http://CRAN.R-project.org/package=SuperLearner. R package version 2.0-6.Search in Google Scholar

23. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol 2007b;6:Article25. doi: 10.2202/1544-6115.1309.Search in Google Scholar

24. Ripley BD. Pattern recognition and neural networks. Cambridge, New York: Cambridge University Press, 1996.10.1017/CBO9780511812651Search in Google Scholar

25. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw Feb 2010;33:1–22. ISSN 1548–7660.10.18637/jss.v033.i01Search in Google Scholar


Supplemental Material

The online version of this article (DOI: 10.1515/ijb-2015-0013) offers supplementary material, available to authorized users.


Published Online: 2016-5-26
Published in Print: 2016-5-1

©2016 by De Gruyter

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2015-0013/html
Scroll to top button