1 Introduction
A proliferation of statistical/data science methods have accompanied a growing systematic collection of data across many scientific fields. Progress has been made developing quantitative statistical methods well-suited to exploratory analysis, however, much remains much to be done for deriving estimators and robust inference of relevant parameters in such a context. The growth of fields such as precision medicine and high dimensional (high throughput) biology try to capitalize on the resulting “big data” by inspired pattern-finding procedures [1]; less emphasis has been given to formally defining the parameters such procedures “discover”. Thus, an obvious first step necessary for driving theoretical results is to explicitly define such data adaptive parameters. The goal of previous work [2] and this paper is address the issue of rigorous inference when the target parameter is not pre-specified.
Our main proposed approach aims to keep the data-adaptive part of the sample splitting algorithm described in quote, but to define an average of the data-adaptive parameter across arbitrary splits of this sort (we emphasize V-fold cross-validation below). In this way, one can still use the power of the entire dataset while avoiding strong conditions on the algorithms used to data-adaptively define the parameters.The “textbook” advice for avoiding problems of this type is to collect fresh samples from the same data distribution whenever one ends up with a procedure that depends on the existing data. Getting fresh data is usually costly and often impractical so this requires partitioning the available dataset randomly into two or more disjoint sets of data (such as a training and testing set) prior to the analysis. Following this approach conservatively with m adaptively chosen procedures would significantly (on average by a factor of m) reduce the amount of data available for each procedure.
1.1 Motivating example
The general methodology for estimation and inference for data adaptive parameters is presented below, however, we illustrate the method by a particularly challenging causal inference estimation problem. Consider data from the (W)estern (C)collaborative (G)roup (S)tudy [4], a prospective study of risk factors of coronary heart disease (CHD). The study consisted of 3,524 males (3,142 of which had complete data) aged 39–59, working in certain California corporations, who were enrolled at the outset of the study and followed for 8.5 years. Our goal is to estimate the impact on CHD from applying a treatment rule regarding cholesterol that is learned from the data. Define the data of interest to be
We will discuss several specific examples below, but note that the sequence of analyses often used in large scale omic studies (genomics, proteomics, metabolomics, etc.; Zhang and Chen [7], Berger et al. [8]) can be the result of a series of suggested patterns that lead to further analyses not previously considered, e. g., multiple testing, to clustering, to exploration of pathways, to more targeted analyses all with the data from the same experiment. In these cases, inference that ignores that the parameters were derived data adaptively will typically be biased. Others have noted the particular dangers of high dimensional data combined with flexible methodologies to generate excessive false positive findings (Ioannidis [9], Broadhurst and Kell [10]). In many cases, even when the best intentions are to stick to a pre-specified data analysis plan, there can be feedback from the data and models chosen (e. g., covariates dropped, different basis functions tried, unplanned sub-group analyses conducted, etc.; Barraclough and Govindan [11], Marler [12]). Thus, it is important to have methods that allow such exploration and also transparent interpretation of the resulting estimates. Though there are advantages for pre-specifying the algorithm used to generate the parameter(s), the general methodology does not even require one to – that is, one can derive inference for methods of deriving patterns, even when the precise methods used to generate the parameters are not known. Thus, it can be applied in circumstances where there is little constraint on how the data is explored to generate potential parameters of interest for estimation and inference.
2 Methodology
The following is also presented in Hubbard and van der Laan [2]. Consider observed data
Given random split Bn,
The choice of target parameter mapping and corresponding estimator can be informed by the data
Suppose that, given
Then,
The latter variance estimator avoids finite sample bias by using sample splitting and might therefore be preferable in finite samples. The proofs of theorems are provided in the Supplemental Material.
Asymptotic equivalence of standardized estimator and standardized oracle estimator Suppose that the algorithm
2.1 Splitting the sample, but using the whole sample to fit the data adaptively generated target parameter
In the above Theorem 1, one need not assume Donsker class conditions, so that the target-parameter choices
As above assume that conditional on
We also assume that
Then,
The relative efficiency of the two estimators
2.2 Using the whole sample to generate the target parameter and to subsequently estimate it: no sample splitting
Consider a mapping
Assume
Again, this estimator
Note, that others have examined the asymptotic of such a procedure (no sample splitting) using different theoretical approaches. For instance, Dwork et al. [3] presents asymptotic consistency for estimating a number (m) of data-adaptively derived functions of the data-generating distribution as a function of m and sample size, n.
3 Examples
In this section we showcase a few examples to demonstrate the proposed procedures for generating statistical target parameters and corresponding estimators and confidence intervals. For longer list of examples, see Supplemental Material.
3.1 Inference for the sample-split conditional risk of a data adaptive regression estimator
The set-up is identical to Dell et al. [13], Dudoit and van der Laan [14]. Let
The estimator of
Similarly, Theorem 2 implies a formal result for
3.2 Inference for sample-split subgroup-specific causal effect, where the subgroups are data adaptively determined
Consider “discovering” sub-groups within the target population that have unique relationships with explanatory variable of interest (e. g., drug treatment, environmental exposure, etc.). In the case that these sub-groups are not defined apart from the data, post-hoc sub-group analysis is typically treated as purely explanatory and thus the statistical inference inherently awed, typically anti-conservatively. However, the approach we have outlined provides explicit framework for aggressively searching for interesting sub-groups, but still allows for consistent statistical inference for the resulting estimators of association parameters.
Suppose that we observe on each subject
Let
Define
Under the Donsker class condition on
4 Simulations
Simulations for different algorithms producing the data adaptive target parameters were examined for performance among the three different algorithms based on theorems 1, 2 and 3 (referred to as algorithms 1, 2 and 3). The following step provides the structure of the algorithm, but also provides some basis of understanding the data adaptive parameter.
- 1.Generate a random sample from the data generating distribution of size n and break into V equal size estimation samples of size
with corresponding parameter generating samples of size - 2.For each parameter-generating sample, apply the data-adaptive algorithm to define the parameter to be estimated on the corresponding estimation sample, which defines
For instance, fit a data-adaptive regression procedure estimating the mean of outcome Y based on predictors W, say and define the target parameter as the risk based on squared error loss defined as treating as fixed and known. - 3.For each of the V estimation samples, estimate the data adaptive parameter. For example, in the case of the risk example described in 2.,
In addition, derive the influence curve of this estimator for each of the sample-splits. - 4.To derive the value of the true parameter corresponding to each parameter-generating sample, we draw a very large sample using the same distribution, representing a target population (P0). This is used to evaluate
where P0 is approximated by this empirical probability distribution of this very large sample (100,000). - 5.Estimate the asymptotic variance (4) of
based on the sample variance within estimation samples of (see Theorem 1 above), and construct a corresponding Wald-type confidence interval. - 6.Repeat 1–5 for 1,000 simulations, examine the distribution of standardized differences,
and determine the coverage probabilities for the confidence intervals.
4.1 Risk estimation of a data adaptive prediction algorithm
More background on the conditional risk parameter is in Section 3.1, and in this case the goal is estimation and inference regarding the “fit” of a machine learning algorithm. The data is

True model
Citation: The International Journal of Biostatistics 12, 1; 10.1515/ijb-2015-0013
4.1.1 Results
We examined the empirical distribution of the standardized differences,

Distribution of
Citation: The International Journal of Biostatistics 12, 1; 10.1515/ijb-2015-0013
Simulation results for risk estimation for data adaptive prediction on theorems 1–3 (
| N | Methods | Average true parameter | Average estimated | Cov. Prob. |
| 0.55 | 0.55 | 0.93 | ||
| 0.54 | 0.54 | 0.92 | ||
| 0.45 | 0.40 | 0.78 | ||
| 0.55 | 0.55 | 0.94 | ||
| 0.55 | 0.55 | 0.94 | ||
| 0.46 | 0.46 | 0.92 | ||
| 0.40 | 0.40 | 0.95 | ||
| 0.40 | 0.41 | 0.95 | ||
| 0.36 | 0.36 | 0.93 |
We also examined the same procedure for estimating the risk difference using algorithm 2. In this case, we observe slower convergence, but still relatively good coverage for an estimate that is particularly sensitive to over-fitting.
4.2 Average treatment effect for given prediction model
Average Treatment Effect, or ATE, is commonly the parameter of interest in applications of causal inference methods (Rubin [5]). We use the same set-up as we did for the example in the introduction, but for the ATE:

Citation: The International Journal of Biostatistics 12, 1; 10.1515/ijb-2015-0013
To derive
Simulation results for ATE for the algorithms based on theorems 1–3 (
| Methods | Average true parameter | Average estimated | MSE | Cov. Prob. |
| 1.71 | 0.89 | 0.82 | 1.19 | |
| 1.71 | 0.89 | 0.82 | 0.51 | |
| 1.68 | 0.89 | 0.79 | 0.63 |
We applied both algorithms 1–3 for sample size of
4.2.1 Results
Examining the empirical distribution of the standardized differences,
4.3 Variable reduction
We consider a situation that has an analogue in high dimensional ‘omic data, where multiple testing is often done to target a relatively small subset of (for instance) genes among the tens of thousands of candidates. The method evaluated in this simulation uses the parameter-generating sample to selects a small subset of the original genes, and subsequently it uses the estimation sample to estimate the effect of these genes on some phenotype. In this manner, it avoids the need to apply multiple testing procedures that control a type-I error rate among very large number of tests.
Let
The estimator of (6) based on the estimation sample is simply
4.3.1 Results
The results of the simulation are shown in Table 3 for the set with
5 Data analysis: Data adaptive estimation of the impact of interventions on cholesterol in WCGS study
We re-visit the original example discussed in the introduction: estimating the impact of the targeted treatment of cholesterol on rates of coronary heart disease (CHD) (3). Again, data is
Simulation results for Variable Reduction for the algorithms based on theorems 1–3 (
| N | Methods | Average true parameter | Average estimated | Cov. Prob. |
| 2.24 | 2.25 | 0.83 | ||
| 2.24 | 2.58 | 0.0060 | ||
| 2.63 | 2.57 | 0.014 | ||
| 2.48 | 2.48 | 0.91 | ||
| 2.48 | 2.55 | 0.66 | ||
| 2.49 | 2.56 | 0.69 | ||
| 2.51 | 2.51 | 0.92 | ||
| 2.51 | 2.58 | 0.77 | ||
| 2.51 | 2.57 | 0.80 | ||
| 2.56 | 2.56 | 0.94 | ||
| 2.56 | 2.58 | 0.86 | ||
| 2.56 | 2.58 | 0.88 | ||
| 2.51 | 2.51 | 0.96 | ||
| 2.51 | 2.52 | 0.95 | ||
| 2.51 | 2.52 | 0.94 |
where, as above,
5.1 Results
The results (Figure 4) suggest one would reduce the risk of CHD by 3.1 % (95 %

Forest plot of estimates (with 95 % CI’s using influence curve-based standard errors) of validation-sample-specific and average data adaptive parameter (9), estimated using the WCGS data, with the intervention on cholesterol, and targeted group being those with an estimated average benefit due to lower cholesterol of a
Citation: The International Journal of Biostatistics 12, 1; 10.1515/ijb-2015-0013
6 Conclusion
Significant scientific progress can be made by generating target parameters based on past studies, and evaluating them on future, independent data. We discussed above, however, how such costly splitting of data is potentially unnecessary; the proposed data adaptive target parameter and corresponding statistical procedure studied in this article allows for general sample splits, and averaging the results across such splits. The theoretical and simulation results demonstrate that statistical inference is preserved under minimal conditions, even though the estimators are now based on all the data. To obtain valid finite sample inference it is important to utilize our corresponding variance estimator (4), and that the sample size for the estimation sample is chosen large enough so that the second order terms of a possible non-linear estimator are controlled.
We also showed that if the algorithm that generates the target parameter is not too adaptive to small changes in the data, then no sample splitting is necessary. Specifically, if the set of influence curves generated by this parameter-generating algorithm when applied to an empirical distribution is a P0-Donsker class, then statistical inference based on the method
We have demonstrate that data adaptive target parameter framework provides a formalized approach for estimating target parameters that are either very hard or impossible to pre-specify. There are many examples of interest that have not been highlighted in this article, where the motivation can come from dimension reduction, or complex causal parameters. There are few constraints on how one uses the data to define interesting parameters and we expect their are many applications in Big Data situations for which this approach is particularly well-suited.
References
1. Witten IH, Frank E. Data mining: practical machine learning tools and techniques, 3rd ed. Burlington, Massachusetts: Morgan Kaufmann, 2011.
2. Hubbard A, van der Laan M. Mining with inference: data-adaptive target parameters. In: P Buhlmann, P Drineas, M Kane, M van der Laan, editors. Handbook of big data, Handbook of Modern Statistical Methods. Boca Raton, FL: CRC Press, 2016:439–52.
3. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving statistical validity in adaptive data analysis. arXiv preprint arXiv:1411.2664, 2014.
4. Ragland DR, Brand RJ. Coronary heart disease mortality in the Western Collaborative Group Study. Follow-up experience of 22 years. Am J Epidemiol 1988;127:462–75.
5. Rubin DB. Bayesian inference for causal effects: the role of randomization. Ann Stat 1978;6:34–58.
6. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003.
7. Zhang F, Chen JY. Data mining methods in omics-based biomarker discovery. Methods Mol Biol 2011;719:511–26. doi: .
8. Berger B, Peng J, Singh M. Computational solutions for omics data. Nat Rev Genet May 2013;14:333–46. doi: .
9. Ioannidis JP. Why most discovered true associations are inflated. Epidemiology Sep 2008;19:640–8. doi: .
10. Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2006;2:171–96.
11. Barraclough H, Govindan R. Biostatistics primer: what a clinician ought to know: subgroup analyses. J Thor Oncol 2010;5:741.
12. Marler JR. Secondary analysis of clinical trials – a cautionary note. Prog Cardiovas Dis 2012;54:335–7.
13. Le Dell E, Petersen M, van der Laan MJ. Computationally efficient confidence intervals for cross-validated area under the roc curve estimates. Technical report, U.C. Berkeley Division of Biostatistics Working Paper Series, http://www.bepress.com/ucbbiostat/paper304, 2012.
14. Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005;2:131–54.
15. van der Laan MJ, Polley EC, Hubbard AE. Superlearner. Stat Appl Genet Mol Biol 2007a;6.
16. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60.
17. van der Laan M, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.
18. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006;2. URL http: //www.bepress.com/ijb/vol2/iss1/11. Article 11.
19. Venables WN, Ripley BD. Modern applied statistics with S, 4th ed. New York: Springer, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.
20. Gelman A, Su Yu.-S, Yajima M, Hill J, Grazia Pittau M, Kerman J, Zheng T. Arm: data analysis using regression and multilevel/hierarchical models, 2012. URL http://CRAN. R-project.org/package-arm. R package version 1.5-08.
21. Hastie TJ, Tibshirani RJ. Generalized additive models. New York: Chapman and Hall, 1990.
22. Polley E, van der Laan M. SuperLearner: Super Learner Prediction, 2012. URL http://CRAN.R-project.org/package=SuperLearner. R package version 2.0-6.
23. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol 2007b;6:Article25. doi: .
24. Ripley BD. Pattern recognition and neural networks. Cambridge, New York: Cambridge University Press, 1996.
25. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw Feb 2010;33:1–22. ISSN 1548–7660.
Footnotes
Supplemental Material
The online version of this article (DOI: 10.1515/ijb-2015-0013) offers supplementary material, available to authorized users.




