All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples

There is a long-standing debate in the statistical, epidemiological and econometric fields as to whether nonparametric estimation that uses data-adaptive methods, like machine learning algorithms in model fitting, confer any meaningful advantage over simpler, parametric approaches in real-world, finite sample estimation of causal effects. We address the question: when trying to estimate the effect of a treatment on an outcome, across a universe of reasonable data distributions, how much does the choice of nonparametric vs.~parametric estimation matter? Instead of answering this question with simulations that reflect a few chosen data scenarios, we propose a novel approach evaluating performance across thousands of data-generating mechanisms drawn from non-parametric models with semi-informative priors. We call this approach a Universal Monte-Carlo Simulation. We compare performance of estimating the average treatment effect across two parametric estimators (a g-computation estimator that uses a parametric outcome model and an inverse probability of treatment weighted estimator) and two nonparametric estimators (Bayesian additive regression trees and a targeted minimum loss-based estimator that uses an ensemble of machine learning algorithms in model fitting). We summarize estimator performance in terms of bias, confidence interval coverage, and mean squared error. We find that the nonparametric estimators nearly always outperform the parametric estimators with the exception of having similar performance in terms of bias and similar-to-slightly-worse performance in terms of coverage under the smallest sample size of N=100.

1 Introduction he past two decades have seen rapid growth of nonparametric statistical estimation methods (e.g., Van Der Laan and Rubin, 2006;Chernozhukov et al., 2018;Hill, 2011;LeCun et al., 2015;Hahn, 1998).A large and growing subset of the statistical, epidemiological, econometric, and other applied literatures take as a given that nonparametric estimation methods are superior to parametric approaches, particularly in terms of reducing or eliminating model misspecification bias, which can be substantial in an over-simplified parametric model.(Minimizing all sources of bias is essential, especially because when bias remains on the same order as the standard error, the probability of a corresponding hypothesis test rejecting the null will tend to one even when no effect is present.)However, few have contributed evidence of the extent to which this assumption of nonparametric estimator superiority is borne out in real-world, finite-sample analyses, leading to debate as to whether nonparametric methods confer any meaningful practical advantage in typical applied estimation of causal effects Imbens (2004).
Both nonparametric and parametric estimation methods generally involve using theory and subject-matter knowledge to inform the underlying causal model/graph Pearl (2009) and the variables input into the model Zhao and Hastie (2019).However, whereas parametric estimation methods would proceed by specifying parametric models (e.g., linear regression) for each component of the causal model used in estimation (e.g., the outcome model), specifying its functional form, interactions, higher-order variable forms, etc., nonparametric methods would typically involve using data-adaptive methods, like machine learning algorithms in flexibly fitting model components.Because correctly specifying parametric models a-priori seems unlikely in the absence of a clear mechanistic understanding, nonparametric methods can guard against bias due to model misspecification.
Consequently, nonparametric methods may be expected to have an advantage when an accurate model is important to obtain an unbiased estimate, e.g., in the presence of significant, complex confounding or censoring, especially when response and/or treatment assignment surfaces are nonlinear Dorie et al. (2019).In real data analyses, this may happen with observational study data with strong, complex confounding of the exposure-outcome relationship, or in both observational study data and randomized control trial data with significant, differential dropout, mediation questions, or estimation of heterogeneous treatment effects Wendling et al. (2018).
Previous simulation studies comparing performance between parametric and nonpara-metric estimation have generally shown that nonparametric estimators incorporating dataadaptive, machine learning algorithms outperform parametric approaches, especially under the data structures enumerated above Dorie et al. (2019); Porter et al. (2011); Ozery-Flato et al. (2018).However, common critiques of simulation studies are that they (a) are typically designed to illustrate a difference in performance (or lack thereof), and thereby reflect hand-picked settings that are favorable to the method of interest; (b) are oversimplified "stylized models of reality"; and (c) may hold little value if the goal is to make a general statement about the degree to which the choice of nonparametric versus parametric estimation matters  Kreif and DiazOrdaz (2019); Keele and Small (2018).
Responding to the critique that simulation-based evaluations of estimator performance have been conducted on oversimplified data too far removed from the complexity of realworld datasets, two groups recently proposed more comprehensive, yet tailored, evaluations of estimator performance by simulating datasets that closely mimic an observed dataset in all its complexity (similar to the idea of "plasmode simulation" Franklin et al. ( 2014)) Schuler et al. (2017); Parikh et al. (2022).This work is premised on the belief that the optimal estimator will differ for different data scenarios, hence the focus on generating simulated data to match a particular observed dataset.However, by using the same data to both select an estimator (on the basis of its performance on a new data-generating mechanism based on the observed data), and evaluate this estimator, these approaches are vulnerable to problems with post-selection inference.Given sufficiently large sample sizes, this problem can be circumvented via sample splitting; however, practitioners can seldom afford to discard large amounts of data to perform estimator selection.
The current debate about how much the choice of nonparametric versus parametric may matter in the real world asks a more general question than the current literature can answer.Namely, across a universe of reasonable data distributions, how much does the choice matter?We propose an approach to answer such a question: a Universal Monte-Carlo Simulation that instead of considering just a few data-generating mechanisms, summarizes estimator performance across thousands of data-generating mechanisms.Unlike the approaches of Schuler et al. (2017); Parikh et al. (2022), ours does not depend on the observed data.We apply this approach to provide a general quantification of the degree to which choosing a nonparametric versus parametric estimation approach impacts bias, confidence interval coverage, and mean squared error in the resulting causal effect estimates.
This paper is organized as follows.In Section 1, we introduce notation, the causal estimand we consider, and summarize the necessary theory underlying parametric and nonparametric approaches for estimating the causal estimand.In Section 2, we describe our proposed Universal Monte-Carlo Simulation for comparing parametric and nonparametric estimation.Also in this section, we provide the specific parameters that define the universe of data-generating mechanisms we consider, the parametric and nonparametric estimators we compare, and how we quantify and summarize estimator performance in finite samples.In Section 3, we provide and discuss results.Section 4 concludes.
2 Notation, estimands, and background on parametric and nonparametric estimation We focus on estimation of the average treatment effect (ATE) in our simulations.The ATE is the average difference in the expected outcomes under treatment versus under control.For simplicity, in this section, we focus on one component of the contrast-the expected counterfactual outcome had treatment been set to some value, t, possibly contrary to fact, denoted E(Y t ), where T denotes a binary treatment variable, and Y denotes the outcome variable.The ATE would then be denoted E(Y 1 ) − E(Y 0 ).We assume that the exchangeability assumption holds, Y (t) ⊥ ⊥T | X for t ∈ {0, 1}, where X denotes confounding variables.This is necessary for the causal parameter, E(Y (t) ), to be identified from the observed data, Z = (X, T, Y ), by the parameter θ where θ denotes the parameter, which is a function of the data distribution, P.
We note that θ(P) constitutes the building block for estimation of common marginal causal effects: the ATE, identified by E{m(1, X)−m(0, X)}); the relative risk, identified by E{m(1,X)} E{m(0,X)} ; and the odds ratio, identified by E{m(1,X)}/E{1−m(1,X)} E{m(0,X)}/E{1−m(0,X)} , where m(t, x) represents E(Y | T = t, X = x).Each of these parameters can be represented as a functional mapping a probability distribution P to the real numbers, and can be denoted with θ(P).
When the exposure takes values in a discrete set, like the binary exposure we consider here, the above parameters can be nonparametrically estimated, yet with the same asymptotic properties as if they were parametrically estimated, as long as the dimension of X is fixed.Taking the ATE as an example, this means that it will be possible to find an estimator ATE such that as the sample size n grows, √ n( ATE − ATE) converges to a random variable that is normally distributed with mean zero and variance equal to the non-parametric efficiency bound.The efficiency bound is the smallest possible variance attainable by a regular estimator, see Hahn (1998).This result is useful, because it means that the normal distribution can be used to approximate the sampling distribution of the estimator ATE for finite sample sizes, which allows us to construct approximately correct confidence intervals and hypothesis tests.

Parametric estimation with substitution estimators
We can alternatively write θ(P), defined above, as: A natural estimator for this parameter is given by the following procedure: 1. Fit a model for m(t, x), e.g., the logistic regression model logit m β (t, x) = β 0 +β 1 t+ β 2 x.Let β denote the maximum likelihood estimate (MLE).
2. For each subject i = 1, . . ., n, compute the predicted outcome m β (t, X i ) in a hypothetical world where their treatment level is set to T i = t for everyone.For example, for the logistic regression model above this is m β (t, 3. Compute the substitution estimator of θ(P) as This estimator is often referred to as the g-computation estimator in the biostatistics and epidemiology literatures.Note that we have denoted this estimator with θ( P), since it is the result of plugging in a regression estimate of m(t, x) and the empirical distribution estimate of P(x) in the definition of the parameter (1).
Even when the parametric logistic regression model is wrong, it can be used to estimate parameters such as the ATE, RR, and OR.This is not only true for logistic regression; any regression fit m(t, x) (including machine learning) can be mapped to an estimator of θ(P) using the above procedure.
When the regression m(t, x) is estimated with a parametric model (e.g., logistic, linear, Poison regression), standard asymptotic tools such as the delta method yield the following result.Let β denote the probability limit of the MLE β of the model parameters, and let m β (t, x) denote the model fit for any value β.Then, we have θsub (δ) ≈ N {θ(P) + B sub (P), σ 2 sub /n}.We let B sub (P) denote the asymptotic bias of the estimator (B sub (P) = (mβ(t, x) − m(t, x))dP(x)), and σ 2 sub ≥ 0 denote the asymptotic variance.We note that the bias depends on P through three quantities: the limit β of the model parameters, the true function m(t, x), and the probability distribution of the covariates P(x).The approach we propose in Section 3 allows us to summarize the distribution of B sub (P) (which depends on P, but is fixed for a given P) across a wide range of data-generating mechanisms, P.

Nonparametric estimation using machine learning for model fitting
When the regression m(t, x) is estimated with a machine learning (data-adaptive) algorithm, the delta method cannot be applied and other tools must be used for inference.Specifically, the substitution estimator that uses machine learning in model fitting, θ( P), has the following first-order bias: −E[φ(Z; P)], where φ(Z; P) = 1{T =t} g(t|X) {Y − m(t, X)} + m(t, X) − θ(P), and g(t | x) = P(T = t | X = x), is the propensity score Rosenbaum and Rubin (1983).
This bias expression is at the core of frequentist nonparametric efficient estimators, such as targeted minimum loss based estimation (TMLE van der Laan andRose, 2011, 2018) and double/ debiased machine learning (Chernozhukov et al., 2018).For example, TML estimators are constructed such that the above bias is approximately equal to zero.The double/debiased ML estimators are focused on obtaining a reasonable estimate of the bias and subtracting it from the substitution estimator.For example, the one-step estimator (in this case, also known as the augmented inverse probability weighted estimator) may be constructed as: and the TMLE is equal to θtmle = θ( P), where P is such that Under regularity conditions1 it can be shown that if the estimators m and ĝ converge to some functions m and ḡ,2 then the one-step and TML estimators are doubly robust.That is, θtmle is the asymptotic bias, which is zero if either ḡ = g or m = m, i.e., if either the outcome regression or propensity score estimator is consistent.When both equalities hold (i.e., both are consistent), these estimators are asymptotically normal and semiparametric efficient, meaning √ n{ θtmle −θ(P)} N (0, σ 2 eff ) , where σ 2 eff is the semiparametric efficiency bound.The asymptotic variance σ 2 eff can be consistently estimated by the sample variance of φ(Z i , P).However, we note that nonparametric Bayesian estimators are not constructed around this bias term, and instead use flexible Bayesian formulations for modeling coupled with a plug-in estimation strategy.

Theoretical comparison of parametric and nonparametic estimator bias
B tmle (P) is an integral of a product of two regression errors.If the two errors are small, this product will generally be smaller than the single error involved in the definition of B sub (P).Thus, it will often be the case that |B tmle (P)| ≤ |B sub (P)|.Use of machine learning to fit m and g will typically make the regression errors involved in B tmle (P) even smaller.This may also be true for the bias of θsub (δ), but substitution estimators based on naïve machine learning are not generally asymptotically normal, which challenges the construction of sampling distributions, and so is problematic for frequentist statistical inference (though we note it is not a problem for Bayesian statistical inference).Because B tmle (P) involves the product of two regression errors, it will be zero if either m or g is known, a property known as double robustness.
In randomized trials, g(t | x) is known by design, and therefore B tmle (P) (unlike B sub (P)) can be made exactly zero due to double robustness.Even in the randomized trial setting, covariance adjustment estimators based on modeling θ(P) are typically preferred to unadjusted estimators based on E(Y | T = t), because covariance adjustment estimators can use baseline characteristics to control for noise in the outcome, thereby yielding more precise effect estimates.
Based on these asymptotic results, nonparametric estimators of θ(P) are expected to have better performance than parametric estimators in very large samples.However, asymptotic approximations may be poor in small or even moderately-sized real-world samples that are afflicted by the curse of dimensionality, resulting in bias and under-coverage of confidence intervals (Robins and Ritov, 1997).The sample size at which asymptotic approximations reflect reality is generally problem-dependent and unknown.In the next sections, we develop a simulation-based approach to evaluate if and when nonparametric estimators of θ(P) can be expected to outperform parametric estimators across a range of finite samples under minimal but reasonable assumptions on the nature of the true data generating mechanism.
3 A Universal Monte-Carlo Simulation approach to systematically evaluate the finite sample performance of estimators We propose the following general method to systematically evaluate and compare the performance of a set of candidate estimators in finite samples across a largely unrestricted space of data distributions.To define the space of distributions, we assume some information about the data generating mechanism (e.g., in terms of the number and type of variables, confounding bias, treatment effect heterogeneity, and degree of nonlinearity), drawing on the nonparametric Bayesian literature.
1.For each j ∈ {1, . . ., J}, we draw a probability distribution P j using a minimallyinformative prior across the space of distributions (i.e., nonparametric model), M.
The minimally-informative distribution on M reflects the fact that investigators often have little knowledge of the data generating mechanism before seeing the data.We describe our procedure in §3.1.
2. Then, for each of the J data generating mechanisms, we generate S simulated datasets.So, for each j, s we have a dataset D j,s of some finite sample size n from P j .
4. We next compute performance metrics based on sample averages across the S simulated draws.For distribution P j and sample size n, the Monte Carlo bias, confidence interval coverage, and MSE of a given estimator θl are θl,j,s − θ(P j ) where c denotes the 97.5th percentile of the standard normal distribution.
5. Lastly, we summarize performance across the J data distributions with a probability distribution of each performance metric (e.g., the bias).Specifically, the bias for estimator l at sample size n is summarized by This function can be seen as an approximation of the so-called survival or reliability function, where the former name is common in the statistics literature and the latter in the engineering literature.We use the name reliability function, because it is closer to the intended interpretation for each estimation procedure.Because we consider the sample space of probability distributions as all possible data generating mechanisms (i.e., phenomena under study), the probability distribution can be heuristically interpreted as probabilities across all possible studies.
In contrast to standard simulation studies, which often focus on a few data generating mechanisms, this Univeral Monte-Carlo Simulation will provide evidence of the behavior of the estimators across many data generating mechanisms, and consequently, will be in-formative for assessing how the estimators perform across the range of problems that may be encountered in practice.
3.1 Specifying and sampling from the space of probability distributions Our simulations will be restricted to binary treatments and binary outcomes.Although we do allow for non-binary confounding variables, we restrict to discrete, ordinal variables.These restrictions ensure computational tractability but generalizations can in principle be devised for any data structure.Our models for P will be characterized by the following user-given, minimally-informative priors, representing an analyst's prior knowledge of the data generating distribution.
• An integer u representing the number of binary confounding variables.
• An integer h representing the number of non-binary confounding variables, together with an integer c denoting the cardinality of the support of each.Without loss of generality, we assume that each is supported in the set N = {0, 1/(c − 1), 2/(c − 1), . . ., 1}.
• A parameter η that controls the non-linearity of the effect of the non-binary confounding variables in the data generating process (additional detail below).• A parameter ρ that controls the smoothness of the effect of the non-binary confounding variables in the data generating process (additional detail below).• An interaction order k (additional detail below).
• A boolean value hte ∈ {TRUE, FALSE} indicating whether there is treatment effect heterogeneity, meaning that the effect of the treatment on the outcome varies by the level of one or more confounding variables.• A positive number q bounding the treatment probabilities as max t∈{0,1},x∈X where X is the support of X. • A real number b representing the desired confounding bias: Our sampling scheme proceeds sequentially by first sampling P(X = x), then P(T = 1 | X = x), and finally P(Y = 1 | T = t, X = x).Note that in the above setup, the vector X takes values on the set X = {0, 1} u × N u , which has cardinality 2 u × c h .In the first step, we sample the vector {P(X = x) : x ∈ X } using a Dirichlet uniform distribution in [0, 1] 2 u ×c h .Other distributions could be used that would induce more or less correlation among the covariates; see Dunson and Xing (2009) for distributions over the space of multivariate, categorical data.

Sampling treatment probabilities
Let X = (X bin , X num ) represent the partitioning of binary and non-binary confounding variables.Treatment probabilities are generated using a linear probability model where we consider: 1. interactions up to k-th order for X bin , and 2. interactions of each of the above terms with a non-linear transformation of X num .
Let L = {l = (l 1 , . . ., l u ) : l j ∈ {0, 1}; j l j ≤ k}.For instance, when u = 3 and k = 2, L = {(0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), ( 0 For example, when u = 2 and k = 1, The parameters (α 0,l , α 1,l , f l ) are sampled as follows.First, each f l (x num ) is sampled from a Gaussian process with linear mean γ x num and covariance function equal to where each coefficient in γ is independently drawn from a standard normal distribution.Then, the vector of coefficients {(α 0,l , α 1,l ) : l ∈ L} is sampled uniformly from a convex polytope defined by the following linear constraints for all x ∈ X : where the first two constraints ensure that g(x; α) is a well defined probability.The third and fourth constraints prevent near-positivity violations governed by the parameter q.That is, they ensure the conditional probability of each treatment level given each covariate value is not too small relative to the marginal probability of that treatment level.Note that the third and fourth constraint are indeed linear as, for example, the third constraint can be rewritten as x g(x)P(X = x) − q × g(x; α) ≤ 0. The above sampling is performed using the volesti R package (Fisikopoulos et al., 2020).Once a vector {(α 0,l , α 1,l ) : l ∈ L} is sampled, it is checked for whether it can possibly yield a desired confounding bias b, the details of which are given in the SI Appendix.If not, the current draw is rejected.

Sampling the outcome mechanism
Outcomes are also generated using a linear probability model: if there is treatment effect heterogeneity, and if there is no treatment effect heterogeneity, where xl = u j=1 x l j bin,j to simplify notation.The functions h l (x num ) and w l (x num ) are drawn from Gaussian processes as before.The confounding bias for a given distribution is equal to and is is linear in the coefficients {λ t,l , β t,l : t ∈ {0, 1}, l ∈ L}.So, for a tolerance tol, we can draw the coefficients from a uniform distribution in the polytope defined by the following linear constraints: where C(λ, β) denotes equation ( 5) with P(T = 1 | T = t, X = x) replaced by m(t, x; λ, β).

Our application of the Universal Monte-Carlo Simulation approach
We now use the above described approach to summarize performance of parametric and nonparametric estimators in estimating the average treatment effect, ATE = E{m(1, X) − m(0, X)}.
We define our universe of data generating distributions, M, as follows.In addition to one binary treatment and one binary outcome, we considered five binary covariates (u = 5) and one numerical covariate (h = 1) with cardinality 100.We considered interactions between covariates of order k = {1, 2, 3}.We considered distributions both with and without treatment effect heterogeneity (hte ∈ {TRUE, FALSE}), and limiting the treatment probabilities as being ≥ 0.001 (q = 1000).The parameter, η, controlling the nonlinearity of the numerical confounder was sampled from a uniform distribution U (0.1, 10); the parameter, ρ, that controls the smoothness of this numerical confounder was sampled from a uniform distribution U (0.1, 10) (see Figure S1 in the appendix to visualize the nonlinearity/smoothness); and the parameter, b, that controls the confounding bias was sampled from a uniform distribution U (−0.3, 0.3).
Within this universe of data generating distributions, and for each combination of k and indicator of treatment effect heterogeneity, we sampled J = 500 distinct DGPs, resulting in 3 × 2 × 500 = 3, 000 distributions.For each of these 3,000 distributions, we then sampled S = 250 data sets for each of the sample sizes N ∈ {100, 500, 1000}.In total, we simulated 2,250,000 data sets3 .
For each data set we then estimated the ATE using L = 4 parametric and nonparametric estimators. 1) First, we considered a parametric substitution estimator (also called a g-computation estimator), as described in Section 2, based on modeling the outcome using logistic regression with only main effect terms included.This estimator is commonly used in practice and may be less sensitive than others to the curse of dimensionality.2) The other parametric estimator we considered is an inverse probability of treatment weighting (IPTW) estimator with weights determined using the covariate balancing propensity score (CBPS), which optimizes the balance of covariates across the treatment and control groups (Imai and Ratkovic, 2014).Propensity score estimators are popular in applied research, despite their inefficiencies Robins et al. (2007).The CBPS estimator is one of the best-performing propensity-score-based estimators in finite samples and is robust to mild misspecification of the parametric model (Imai and Ratkovic, 2014).3) Third, we considered a Bayesian nonparametric estimator: Bayesian adaptive regression trees (BART, Hill (2011); Chipman et al. ( 2010)), and 4) fourth, we considered the nonparametric TMLE Van Der Laan and Rubin (2006) as described in Section 2, because these nonparametric estimators performed well in many finite sample scenarios and data challenges previously Dorie et al. (2019).For data-adaptive estimation in TMLE, we use the Super Learner (van der Laan et al., 2007) with a library of estimators consisting of main-effects generalized linear models, BART Chipman et al. (2010), light gradient-boosting machine Ke et al. (2017), and multivariate adaptive regression splines Friedman (1991).5) Finally, we also consider a cross-fitted TMLE (CV-TMLE), which typically results in better finite sample performance due to avoidance of the Donkser class condition required for asymptotic normality Klaassen (1987); Zheng and van der Laan (2011); Chernozhukov et al. (2018).

Results
We evaluated estimator performance in terms of absolute bias, MSE, and 95% CI coverage.Figures 1 and 2 show the reliability function for the proportion of J data distributions where the estimator's absolute bias and MSE, respectively, are greater than x, for each value on the x-axis.Table 1 gives the 95% CI coverage.We stratify results by each es-timator, the three sample sizes, and by complexity of the data-generating mechanism.To vary complexity we examine a) no interactions between variables and no treatment effect heterogeneity, b) treatment effect heterogeneity but no interactions between variables, c) up to 3rd order interactions between variables and no treatment effect heterogeneity, and d) up to 3rd order interactions and treatment effect heterogeneity.We also stratify results by degree of practical violations of the positivity assumption.We see in Figures 1 and 2 that at least one, and often, all three of the nonparametric estimators, BART, TMLE, and CV-TMLE perform better than both parametric estimators in terms of both bias and MSE.The one exception is for the smallest sample size of N = 100 and the simplest two data-generating mechanisms that do not have any interactions between variables, the parametric g-computation estimator performs similar to the two TMLEs in terms of bias (Figures 1a and 1b), and BART performs relatively worse.In terms of MSE, however, at least one of the nonparametric estimators performs best across all settings (Figure 2), with BART dominating in all cases when N = 100 and performing at least close to the best in all other scenarios.
The largest separation between the parametric versus nonparametric reliability curves in Figures 1 and 2 occur for sample sizes N ∈ {500, 1000} and in the more complex settings of variable interactions and possibly treatment effect heterogeneity.For example, in the most complex setting with N = 1000 (Figure 1d), the nonparametric estimators result in absolute bias > 0.1 in 0% of data distributions, but the parametric estimators result in absolute bias > 0.1 in 11.1% and 13.3% of data distributions for the parametric g-computation estimator and IPTW estimator, respectively.This result was anticipated, because in larger sample sizes, the asymptotic advantages of nonparametric estimators to model complexity and nonlinearities discussed in Section 2 may take hold.
We also stratify by degree of practical positivity violations (minimal, moderate, severe) in Figures S2 and S3 in the SI Appendix.Although estimator performance degrades slightly in the presence of severe positivity violations, particularly for the IPTW estimator, the relative performance of the estimators remains the same.
In terms of 95% CI coverage (Table 1), we see that all estimators perform well for sample size N = 1000 in the simplest scenarios without interactions, with slight overcoverage by IPTW and BART and slight under-coverage by the non-cross-fitted TMLE.However, with more complexity in terms of interactions, coverage suffers for the parametric estimators, particularly with increasing sample size; in contrast, BART and CV-TMLE continue to attain at least the nominal coverage rate.The recovery of nominal coverage by CV-TMLE compared to TMLE illustrates the importance of cross-fitting when using data-adaptive regression estimators for asymptotic normality Klaassen (1987); Zheng and van der Laan (2011); Chernozhukov et al. (2018).Interestingly, BART overcovers in every scenario despite generally having the smallest variance across scenarios.(2017).However, simulation studies are limited in that they only evaluate estimator performance across a small number of data-generating mechanisms, which may not be representative of performance in general.Consequently, our proposed approach greatly expands the number of data generating mechanisms considered from a small few to thousands, which is likely to result in more generalizable-and thus, informative-evidence of estimator performance.We applied our proposed approach to compare performance of nonparametric estimators, BART, TMLE, and CV-TMLE, to two parametric estimators, outcome regressionbased g-computation and IPTW, in finite samples of sizes N=100, 500, and 1000 and across different degrees of model complexity.In doing so, we provide what is to our knowledge the first general evidence of nonparametric versus parametric performance across a universe of possible research settings and in finite samples where asymptotic properties learned from theoretical results may not provide good approximations.We found that even in small samples, nonparametric estimators nearly always outperform the parametric estimators in terms of bias and MSE.However, the advantage of nonparametric estimation attenuated with decreasing sample size and decreasing complexity, and in the simplest datagenerating mechanisms and samples of N = 100, the parametric g-computation estimator performed similarly to nonparametric TML estimator in terms of bias and between BART and TMLE in terms of MSE.
Even though our results were learned from 2,250,000 data sets across 3,000 data generating mechanisms, the space of data-generating mechanisms we considered was nonetheless limited.In particular, we only considered settings with a binary treatment, a binary outcome, and six confounding variables, only one of which was multi-valued.It is certainly possible that our conclusions would differ for more complex settings.In future work, we will develop a software tool for running simulations with user-specified outcome and covariate types and covariate dimensions.This way, users would be able to compare estimator performances over a space of data generating mechanisms that are likely to contain the probability distribution corresponding to their real-world data sampling setting or a close approximation thereof.
Lastly, although we have shown that the choice of nonparametric vs. parametric estimator may matter-in some cases more than others-all estimators are limited by the data input.Unmeasured variables and variables measured with error are significant and nearubiquitous limitations that can thwart accurate causal effect estimation and inference.A relatively recent high-profile and high-stakes example involved data errors leading to in-accurate algorithmic predictions in the criminal justice system that resulted in unintended parole denials Rudin and Carlson (2019); Wexler (2017).We can work to improve the estimation step of answering research questions, but the accuracy of our answers will be limited by the weakest link, highlighting the importance of theory, subject matter knowledge, identification, and data quality, in addition to estimation.
In this article, we have focused on strengthening the estimation link.Our results show that in the large space of settings we have considered, this can be accomplished by ing nonparametric estimators grounded in asymptotic theory to substantially reduce bias in large-sample settings with interactions and nonlinearities while compromising very little in terms of performance even in simple, small-sample settings.

Figure 1 :
Figure 1: Reliability function for the absolute bias of each of the four estimators (IPTW, g-computation, BART, TMLE, and CV-TMLE) when considering data generating mechanisms characterized by 1 binary treatment, 1 binary outcome, 5 binary covariates, 1 numeric covariate, and various levels of model complexity.
(a) No treatment effect heterogeneity, no interactions (b) Treatment effect heterogeneity, no interactions (c) No treatment effect heterogeneity, up to 3-way interactions (d) Treatment effect heterogeneity, up to 3-way interactions

Figure 2 :
Figure 2: Reliability function for the absolute MSE of each of the four estimators (IPTW, g-computation, BART,TMLE, and CV-TMLE) when considering data generating mechanisms characterized by 1 binary treatment, 1 binary outcome, 5 binary covariates, 1 numeric covariate, and various levels of model complexity.
(a) No treatment effect heterogeneity, no interactions (b) Treatment effect heterogeneity, no interactions (c) No treatment effect heterogeneity, up to 3-way interactions (d) Treatment effect heterogeneity, up to 3-way interactions

Figure S1 :
Figure S1: Relationship between probability of treatment conditional on the numerical covariate and the values of the numerical covariate for different values of η and ρ.

Figure S2 :
Figure S2: Reliability function for the absolute bias of each of the four estimators (CBPS, gcomputation, BART, and TMLE) by degree of positivity violations when considering data generating mechanisms characterized by 1 binary treatment, 1 binary outcome, 5 binary covariates, 1 numeric covariate, and various levels of model complexity.
(a) No treatment effect heterogeneity, no interactions (b) Treatment effect heterogeneity, no interactions (c) No treatment effect heterogeneity, up to 3-way interactions (d) Treatment effect heterogeneity, up to 3-way interactions

Figure S3 :
Figure S3: Reliability function for the absolute MSE of each of the four estimators (IPTW, g-computation, BART, and TMLE) by degree of positivity violations when considering data generating mechanisms characterized by 1 binary treatment, 1 binary outcome, 5 binary covariates, 1 numeric covariate, and various levels of model complexity.
(a) No treatment effect heterogeneity, no interactions (b) Treatment effect heterogeneity, no interactions (c) No treatment effect heterogeneity, up to 3-way interactions (d) Treatment effect heterogeneity, up to 3-way interactions Busso et al. (2014)7);Advani et al. (2019);Advani et al. (2019);Huber et al. (2013);Busso et al. (2014).A similar criticism regarding lack of generalizability can apply to the few previous real data analysis examples comparing nonparametric versus parametric estimators, sometimes finding meaningful differences and other times, not

Table 1 :
Median (IQR)values of the 95% confidence interval coverage distributions across all estimators and sample sizes.Int = interaction, HTE = treatment effect heterogeneity.