Nonparametric Estimation of Conditional Incremental Effects

Conditional effect estimation has great scientific and policy importance because interventions may impact subjects differently depending on their characteristics. Most research has focused on estimating the conditional average treatment effect (CATE). However, identification of the CATE requires all subjects have a non-zero probability of receiving treatment, or positivity, which may be unrealistic in practice. Instead, we propose conditional effects based on incremental propensity score interventions, which are stochastic interventions where the odds of treatment are multiplied by some factor. These effects do not require positivity for identification and can be better suited for modeling scenarios in which people cannot be forced into treatment. We develop a projection estimator and a flexible nonparametric estimator that can each estimate all the conditional effects we propose and derive model-agnostic error guarantees showing both estimators satisfy a form of double robustness. Further, we propose a summary of treatment effect heterogeneity and a test for any effect heterogeneity based on the variance of a conditional derivative effect and derive a nonparametric estimator that also satisfies a form of double robustness. Finally, we demonstrate our estimators by analyzing the effect of intensive care unit admission on mortality using a dataset from the (SPOT)light study.


Introduction
Estimating causal effects has great scientific and policy importance, and often there is interest in understanding if the effectiveness of a treatment depends on the subjects' characteristics.Conditional or "heterogeneous," effects describe how a treatment effect varies with the subjects' characteristics, and can illustrate qualitatively important phenomena that would be disguised by average effects.Previous work has focused on estimating the conditional average treatment effect (CATE), which considers the difference between counterfactual mean outcomes when all subjects at some covariate level receive treatment and all subjects receive control (e.g., [1][2][3][4][5][6][7], among others).However, in many contexts, researchers cannot force subjects to receive treatment or prevent them from receiving treatment, thereby making the counterfactual interventions behind the CATE unrealistic in practice.As a concrete example, we will consider the effect of intensive care unit (ICU) admission on mortality for emergency room entrants [8].Typically, the counterfactual interventions where everyone is admitted to the ICU and no one is admitted to the ICU are both practically infeasible because there are a finite number of ICU beds and because hospitals have a duty of care towards sick patients.Instead, we may be interested in assessing the causal effect of an intervention that could more realistically be implemented in practice, such as an intervention that moderately increases or decreases the probability of admission to the ICU.For example, increasing or decreasing the number of ICU beds would likely increase or decrease the probability of admission for all patients.Generally, these interventions can best be described with stochastic interventions, which characterize counterfactual outcomes under a shift in the treatment distribution [9][10][11][12][13][14][15].With a binary treatment, this shift can be characterized by an incremental propensity score intervention ("incremental intervention"), which multiplies the odds of treatment by a user-specified factor δ [11,16].
Recent research on stochastic interventions generally and incremental interventions specifically have focused on average effects [9,11,17].In this study, we consider estimating conditional incremental effects (CIEs), where we assess to what extent an incremental effect depends on the subjects' characteristics, which can uncover treatment effect heterogeneity that is obscured by average effects.Furthermore, as well as corresponding to more realistic interventions, there are two additional advantages in considering CIEs instead of the CATE.First, incremental effects are robust to positivity violations, in the following sense.When positivity is deterministically violated, such that subjects have zero or one probability of receiving treatment, incremental effects can still be identified.Moreover, even when positivity is not strictly violated but estimated propensity scores are nonetheless close to zero or one, confidence intervals for incremental effects will not be influenced by these extreme propensity scores.By contrast, if positivity is deterministically violated, the CATE may not be identifiable.And, if positivity is nearly violated, without strong parametric modeling assumptions, it can be difficult to estimate the average treatment effect (ATE) or the CATE, in the sense that variance estimates are large and confidence intervals are wide [18].
A second advantage of using CIEs instead of the CATE is the ability to describe a continuum of policies between treating all subjects and treating none, where the interventions behind the CATE are special cases at each end of the continuum.A researcher might presume that stochastic effects follow a roughly linear relationship from one end of the continuum to the other, with the slope of the line matching the sign of the CATE.As discussed in Remark 2 in Section 2, this assumption is reasonable when conditioning on all the covariates, as the CIE curve must be monotonic in the incremental parameter δ and so its slope will match the sign of the CATE.However, most analyses, including our ICU data analysis in Section 5, condition on only a few covariates of interest, allowing for the possibility of other incremental effect curves.For example, consider Figure 1 a preview of the real data analysis in Section 5which shows CIE curves for several Intensive Care δ 1 corresponds to no intervention, and > δ 1 and < δ 1 correspond to increasing and decreasing the likelihood of admission to the ICU, respectively.The y-axis shows estimated mortality rate.The curves depict the estimated CIE for different ICNARC scores, which measure mortality risk.Our analysis shows that for each ICNARC score, the mortality rate decreases when fewer ( < δ 1) people are sent to the ICU and increases when more ( > δ 1) people are sent to the ICU, but mild changes in ICU admission rates lead to minimal changes in mortality rate.
National Audit & Research Centre (ICNARC) scores (a measure of mortality risk).The x-axis represents the incremental intervention parameter, where = δ 1 corresponds to no intervention while > δ 1 and < δ 1 corre- spond to increasing and decreasing the likelihood, respectively, that patients are admitted to the ICU, y-axis shows estimated mortality rate.The curves illustrate that the estimated counterfactual mortality rate is higher at = δ 5 than at = δ 0.2, in agreement with prior research indicating that admitting everyone to the ICU is harmful compared to admitting no one [8].However, the full curves also illustrate that mild interventions correspond to small changes in mortality rate.Thus, stochastic effects can be evaluated over a continuum of interventions, which may reveal additional information that would otherwise be hidden by only studying the always-treated vs never-treated counterfactual effects.This echoes previous findings that it can be useful to target additional effects beyond these two extreme counterfactuals [19].

Contribution and structure
Motivated by these observations, in this work, we describe how to estimate conditional causal effects for incremental interventions and illustrate how these effects facilitate a more nuanced understanding of treatment effect heterogeneity than the usual CATE.We focus on incremental interventions for two reasons.First, the incremental intervention has an intuitive parameterization for binary treatment since it corresponds to multiplying the odds of treatment by some factor.Second, the intervention demonstrates favorable properties because it is anchored at the observed treatment distribution and considers a smooth shift from the observed distribution.For example, identifying effects with this intervention does not require the positivity assumption that the probability of treatment is bounded between zero and one for all subjects, which is required for identifying the CATE.This allows estimation of CIEs to still be precise even in the face of positivity violations, unlike estimation of the CATE.
We consider three conditional effects in this work.First, we describe the CIE, which is the conditional analog to the average incremental effect.As shown in Figure 1, the CIE is described by a curve for each covariate value; this makes quantifying treatment effect heterogeneity challenging, because we have to consider how much these curves vary across covariate values.As a preliminary extension of the CIE, we describe the conditional incremental contrast effect (CICE), which considers a contrast between two incremental interventions, and is the incremental analog to the CATE.The CICE can enable better understanding of treatment effect heterogeneity than the CIE, but it requires specifying two incremental δ parameters, and it may not immediately be clear which parameter values would be of most interest in a particular application.Therefore, we propose the conditional incremental derivative effect (CIDE), which corresponds to the change in the CIE under an infinitesimal shift of the treatment distribution.We find that the CIDE is particularly useful for quantifying treatment effect heterogeneity for incremental interventions because it allows the researcher to examine the spectrum of interventions like in Figure 1 and also construct estimators and tests to quantify treatment effect heterogeneity, as discussed in Section 4.
For the three conditional effects, we propose two estimators.Our first estimator, the Projection-Learner, estimates the projection of the true conditional effect onto a finite dimensional model.This added structure allows us to re-frame the estimator as the solution to a moment condition and derive an efficient influence function.Utilizing the properties of efficient influence functions, we provide double robust style error guarantees for the Projection-Learner and show that its bias scales as a product of errors of the nuisance function estimators (in this work, the nuisance functions are the propensity score and the outcome regression, and are defined in Section 2).As a result, the Projection-Learner can achieve parametric efficiency even when the nuisance functions are estimated nonparametrically.Our second conditional effect estimator, the I-DR-Learner, is a two-stage meta-learner that extends the DR-Learner studied by Kennedy [3] to incremental effects.For the I-DR-Learner, the first stage estimates the efficient influence function values for the relevant average effect and the second stage regresses those values against the conditioning covariates.We establish when the I-DR-Learner exhibits double robust style guarantees; in particular, the conditional effect must lie in a certain infinite dimensional function class, and the second stage regression must satisfy a form of stability.In this case, we demonstrate that the I-DR-Learner can attain oracle efficiency when the nuisance functions are estimated nonparametrically.Therefore, the I-DR-Learner cannot obtain parametric efficiency like the Projection-Learner, but it can estimate a larger class of true conditional effect curves with oracle efficiency.
Both the Projection-Learner and the I-DR-Learner can be used to estimate conditional effect curves across variables of interest.A natural question is whether there is any treatment effect heterogeneity across the curve.Thus, researchers may also be interested in a one-dimensional summary of effect heterogeneity and a corresponding test for any effect heterogeneity.Therefore, in addition to the CIE, CICE, and CIDE curves, we also propose a fourth effect, the variance of the conditional incremental derivative effect (V-CIDE), which can be used to estimate the degree of effect heterogeneity and test for any effect heterogeneity.For the V-CIDE, we derive a novel double robust style estimator based on its efficient influence function, illustrate that our estimator attains parametric efficiency under weak conditions on the nuisance function estimators, and derive a corresponding test for any effect heterogeneity.
The structure of the work is as follows.In Section 1.2, we define relevant notation.In Section 2, we define the data setup and different estimands of interest, state the causal assumptions required for identification, and establish identification results for our conditional effects.In Section 3, we outline the Projection-Learner and I-DR-Learner and demonstrate their convergence properties in Sections 3.1 and 3.2, respectively.In Section 4, we outline a nonparametric estimator for the V-CIDE, demonstrate its convergence properties, and describe methods for inference.In Section 5, we analyze data on ICU admission from the (SPOT)light prospective cohort study.We estimate that, while greatly increasing subjects' odds of attending the ICU would increase mortality rates, mild to moderate changes in ICU admissions rates lead to minimal changes in mortality rates.While this agrees with what would be concluded for CATE estimation, CATE estimation for this application would not be reliable, because there are positivity violations for this dataset.Using our test, we do not find evidence that there is treatment effect heterogeneity.Finally, in Section 6 we conclude and discuss future extensions of this research.
All of our methods can be implemented using the npcausal package in R [20,21], as demonstrated in the replication materials at https://github.com/alecmcclean/NPCIE.

Notation
We use for expectation and for variance.We use ( ) { ( )} ( ) as a shorthand for sample ‖ ‖ denote the squared Euclidean norm, and for generic possibly random functions f , we let ‖ ‖ ( ) ( ) denote the squared 2 ℓ ( ) norm.We use the notation ≲ a b to mean ≤ a Cb for some constant C, and ≍ a b to mean ≤ ≤ cb a Cb for some constants c and C, so that ≲ a b and ≲ b a.We use ⇝ to denote convergence in distribution and → p for convergence in probability.We use n  to denote the predicted regression function estimate from n samples (e.g., if we considered a regression of Y against X , then 2 Estimands and identification results for CIEs In this section, we describe estimands for incremental effects, and establish assumptions for identifying these effects.Assume we observe Z ), ∈ X d are the covariates, ∈ A 0, 1 { } is the treatment status, and ∈ Y is an outcome.We define potential outcomes Y a as the outcome that would be observed when treatment = A a.
Much of the causal inference literature has focused on estimating the ATE and CATE, defined as follows: To identify the ATE and the CATE, the following three causal assumptions are commonly used and are sufficient: Consistency says that if an individual takes treatment a, we observe the potential outcome under that treatment regime.By contrast, consistency would be violated if, for example, there was interference between subjects, such that one subject's treatment status affected another's outcome.Exchangeability says that treatment is effectively randomized within covariate strata, in the sense that treatment is independent of subjects' potential outcomes after conditioning on covariates.Positivity says that all subjects have a non-zero chance of receiving treatment or control, and positivity may be unrealistic in practice.Although positivity is required to identify the ATE and the CATE, as we show next, only Assumptions 1 and 2 are required to identify CIEs.Remark 1. Ideally, estimability should be balanced against scientific relevance when choosing a causal estimand.While in this work, we focus on incremental effects and highlight their benefits, including robustness to positivity violations and ability to describe a spectrum of interventions, in many settings the scientifically relevant effect may be the ATE or the CATE.In that scenario, even though the incremental effect is estimable, it may not describe the causal effect of interest.

Incremental propensity score interventions
The incremental intervention corresponds to multiplying each individual's odds of treatment by a userspecified parameter δ.We define the propensity score, the probability that an individual receives treatment, as , and then the shifted propensity score under an incremental intervention is defined as Then, the average incremental effect is where Q q π X δ is drawn from a Bernoulli distribution with parameter ; .δ { ( ) } Unlike deterministic interventions, the incremental intervention is stochastic because it does not deterministically assign subjects to treatment or controlrather, it shifts their propensity score.The incremental intervention also corresponds to multiplying the odds of treatment by δ, since . In the ICU application previously discussed, the average incremental effect is the counterfactual 28-day mortality rate across emergency room patients if their odds of admission to the ICU were multiplied by δ.Incremental interventions were first proposed by Kennedy [11], with double robust style estimators for average (possibly time-varying) incremental effects.The analysis of average effects has been extended to censored data [22] and resource-constrained settings [23], and used for estimating the effect of aspirin on the incidence of pregnancy [24]; a review is provided by Bonvini et al. [16].
While this intervention is not prescriptive, since it is unlikely a hospital would specifically admit patients to the ICU based on draws from a Bernoulli distribution, it can be useful for describing interventions that might be implemented in practice.For example, if ICU capacity increased by some number of beds, it is plausible that every patient's odds of ICU admission might increase by a factor δ.Such an intervention cannot be described by the CATE.Meanwhile, even though such an intervention would likely not correspond to Bernoulli draws, an incremental intervention with > δ 1 could nonetheless appropriately describe this coun- terfactual question.Furthermore, a spectrum of δ could appropriately describe the range of admission criteria changes that a hospital may implement.
The incremental intervention is also dynamic in the sense that the intervention changes with X if π X ( ) changes with X .This occurs because the intervention is constant on the odds ratio scale rather than the unit scale.For example, if = δ 2 and the propensity scores for two covariate values are

{ ( ) (
)} { }, then the intervention propensity scores are 2 , 0.5; 2 0.4, 0.66 ) .However, although the interventions are dynamic in the sense just outlined, they are not dynamic in the sense that the user-specified parameter δ changes with X .We leave this as an avenue for future exploration.If δ were allowed to vary with X , a natural question then might be: what is the "optimal" choice of δ at a particular value = X x?As in the deterministic intervention literature, finding an optimal intervention could fruitfully build on the conditional effect estimators proposed in this work [25,26].
Other interventions have also been considered in the literature, such as modified treatment policies, which shift a continuous treatment by a specified amount [10,14,27]; stochastic interventions that shift a continuous treatment distribution [13]; dynamic interventions that depend on some time-varying information about subjects [14,28]; and stochastic interventions which shift a discrete but possibly multi-valued treatment distribution [29], including exponential tilts [9].The incremental effect can also be interpreted as an exponential tilt.Wen et al. [17] recently proposed a similar intervention to the incremental intervention, but their intervention is parameterized as a shift of the risk ratio q π X δ π X ; { ( ) } ( ) , rather than the odds ratio.

CIEs
Now we will consider CIEs.We denote ⊆ V X as either one or a set of covariates, and define the CIE as the counterfactual mean under an incremental intervention conditional on covariates V , ; .
In the ICU application previously discussed, τ v δ ; cie ( ) is the counterfactual average 28-day mortality rate among emergency room patients with covariates v, if their odds of ICU admission were multiplied by δ.Although effects often refer to contrasts in the causal inference literature, for simplicity, we use the general term "effect" to refer to a causal estimand involving potential outcomes.Therefore, we refer to the CIE as an effect.When the effect of interest is a contrast, we explicitly describe it as a contrast, as in the CICE.
The following proposition establishes that the CIE is identifiable as a function of the observed data distribution: Proposition 1.Let Q δ denote the incremental intervention defined in equation (4).Under Assumptions 1-2, the mean counterfactual outcome of the given covariates = V v is identified by ).
All proofs are given in Appendix.Proposition 1 is a straightforward corollary of Corollary 1 in Kennedy [11], which shows that the CIE is identified by a linear combination of the regression functions μ A X , ( ) where the weights depend on the probabilities of receiving treatment and control under the incremental intervention.
The CIE does not consider a contrast between two interventions, and so it does not immediately describe treatment effect heterogeneity.In this sense, it is similar to the conditional counterfactual mean under treatment, ).As a first approach in understanding treatment effect heterogeneity, we define a second estimand, the CICE, which considers the difference between two incremental effects.
The CICE is the difference (conditional at = V v) between the average outcomes if we multiply the odds of treatment by δ u and if we multiply the odds of treatment by δ l .In the ICU application previously discussed, τ v δ δ ; , )is the difference in counterfactual average 28-day mortality among emergency room patients with covariates v if their odds of ICU admission were multiplied by δ u vs if they were multiplied by δ l .Like the CIE, the CICE is a descriptive rather than prescriptive estimandit is unlikely that a hospital would consider interventions where patients are admitted to the ICU by draws from a Bernoulli distribution, where the odds of admission depend on the patient's natural probability of admission, multiplied by the incremental parameter.Nonetheless, the CICE may describe interventions that can be implemented in practice (e.g., making it more or less likely that emergency room entrants are admitted to the ICU), and thus we can understand treatment effect heterogeneity by looking at how the CICE changes with V .The CICE is readily comparable to the CATE since both consider contrasts between two interventions.In fact, if positivity is satisfied in Assumption 3, then the CICE approaches the CATE as → ∞ δ u and → δ 0

Derivative effects
A limitation of the CICE is that it requires specifying two parameters, δ u and δ l , and it may not immediately be clear which parameter values would be of most interest in a particular application.Instead, we can consider a derivative effect, which describes the change in counterfactual outcomes with an infinitesimally small change in the treatment distribution.To ease exposition, we re-parameterize the average incremental effect with t instead of δ, and define the average derivative effect with respect to t, and evaluated at δ, as and the associated CIDE as In the ICU application previously discussed, τ v δ ; cide ( )is the change in counterfactual average 28-day mortality among emergency room patients with covariates v if their odds of ICU admission were infinitesimally increased from δ.The CIDE demonstrates treatment effect heterogeneity if it varies across v. Thus, it can illustrate effect heterogeneity across a continuum of policies if it is evaluated at several values for δ.For example, as we investigate in Section 5, the CIDE can illustrate whether the effect of ICU admission on mortality varies across ICNARC scores.
If τ x δ ; cie ( ) is continuous in x and δ with absolutely integrable partial derivative with respect to δ, allowing differentiation under the integral, the CIDE is identified according to the following result.Proposition 2. Let Q δ denote the incremental intervention defined in equation (4).Under Assumptions 1 and 2 and if τ x δ ; cie ( ) is continuous in x and δ, and its partial derivative with respect to δ is absolutely integrable, then the CIDE is identified by ) and Proposition 2 shows that the CIDE is a weighted average of the difference in mean outcomes under treatment and control, where the weights depend on the propensity scores and the incremental propensity scores.
Remark 2. When conditioning on all the covariates (i.e., = V X), the right-hand side of the identification result in equation ( 9) is monotonic in δ.This is because ω x δ ; ( ) is always non-negative, while − μ x μ x 1, 0, ( ) ( ) does not change with δ.However, when ⊂ V X, the right-hand side of the identification result in equation ( 9) need not be monotonic across δ; if it is not monotonic, this indicates that − μ x μ x 1, 0, ( ) ( ) changes sign as x varies, holding v fixed.
We also propose a one-dimensional functional to assess treatment effect heterogeneity.We consider the variance of the V-CIDE, defined as When this variance equals zero, it implies that the CIDE is constant over V , and thus there is no treatment effect heterogeneity.As before, the V-CIDE depends on δ, so it can be estimated over a grid of δ to evaluate treatment effect heterogeneity over a continuum of policies.By Proposition 2, the V-CIDE is identified by and when = V X this simplifies to In Section 4, we will derive an efficient estimator for the V-CIDE and propose a test for whether there is any effect heterogeneity at all.In Section 3, efficient estimators for the CIE, the CICE, and the CIDE are derived.

Estimating conditional incremental effects
The identification results in Propositions 1 and 2 suggest straightforward "plug-in" estimators for the conditional effects.Given estimates for π X μ a X , ,  ( ) ( ), and ), an estimator can be constructed by plugging these estimates into the identification formulae in equations ( 6) and (9).If models for the nuisance functions are parametric and correctly specified, this approach can be optimal as the plug-in estimator will converge to a normal distribution at a − ∕ n 1 2 -rate.However, if the parametric models are misspecified, then the plug-in estimator will be biased [30].Given this, it is tempting to use flexible nonparametric models to estimate the nuisance functions, in order to alleviate issues of model misspecification.However, in this case, typically the plug-in estimator will inherit the slow rate of convergence for the nonparametric models.
This motivates estimators based on semiparametric efficiency theory [31][32][33][34][35].The first-order bias of the nonparametric plug-in can be characterized by the efficient influence function of the estimand, which can be thought of as the first derivative in a von Mises expansion of the estimand [36].Thus, a natural approach is to estimate the efficient influence function and subtract this estimate from the nonparametric plug-in estimate in order to "de-bias" the plug-in.A benefit of estimators based on the efficient influence function is that their bias is a second-order product of errors of the nuisance function estimators, such that the estimator can achieve − ∕ n 1 2 efficiency even when the nuisance functions are estimated at slower nonparametric rates [33,37,38].We consider two estimators that utilize efficient influence functions; as a result, they both exhibit double robust style error guarantees.
Our first estimator, the Projection-Learner, targets the projection of the true conditional effect onto a finite dimensional working model.Projection approaches have a long history in statistics [39][40][41][42][43] and causal inference [6,[44][45][46], and our approach is closely related to working marginal structural models [47] and assumption-lean-inference [48], in the sense that they also leverage finite-dimensional models but without invoking parametric assumptions.This added structure allows us to re-frame the estimator as the solution to a moment condition, and derive an efficient influence function.We show that the Projection-Learner exhibits a version of double robustness, and attains parametric efficiency under weak model-agnostic − ∕ n 1 4 conditions on the nuisance function estimators, which are achievable for nonparametric estimators under suitable smoothness or sparsity.
Our second estimator, the I-DR-Learner (inspired by the "DR-Learner" in Kennedy [3]), instead targets the true conditional effect.The I-DR-Learner is an estimation procedure that, like many recent CATE estimation approaches, tries to estimate the true conditional effect as flexibly as possible [1][2][3][4][5]7,49,50].Without any further assumptions, no efficient influence function exists for the true conditional effect because it is not pathwise differentiable [51].So, it is not possible to construct an estimator directly from an efficient influence function for the conditional effect.Instead, the I-DR-Learner is a two-stage meta-learner, which estimates the efficient influence function values for the relevant average effect (e.g., the average incremental effect for the CIE) in the first stage, and then regresses these values against the conditioning covariates in the second stage.We show that if the second stage regression satisfies a generalization of the classic stochastic equicontinuitytype condition, the I-DR-Learner exhibits a form of double robustness and achieves oracle efficiency under weak model-agnostic conditions ( − ∕ n 1 4 or slower convergence rates) on the nuisance function estimators.

The Projection-Learner
In this subsection, we illustrate the Projection-Learner.We first define the finite dimensional working model for incremental intervention parameter δ and model parameter β.This model could be for the CIE, the CICE, or the CIDE, in which case we would use δ for the CIDE and CIE, and δ u and δ l for the CICE.For ease of exposition, we suppress the dependence of g v δ β ; , ( ) on δ (or δ u and δ l ).In what follows, we present results in terms of the CIDE, but the results also apply to the CIE and the CICE.
Before developing the theory and methods for the Projection-Learner, we first discuss criteria for choosing the working model.There are at least two: (1) scientific context and (2) model interpretability vs misspecification.Ideally, researchers incorporate subject-specific knowledge to inform the choice of model based on the application at hand.If that does not decide what working model to use, then researchers should balance the tradeoff between model simplicity and model relevance.For instance, a researcher might choose a simple model; e.g., Because this is a simple model, one can easily interpret the parameter estimates of the model.However, if the model is severely misspecified, these parameters may have little bearing on the true underlying conditional effect and so the parameter estimates may be less scientifically relevant.By contrast, a researcher might choose a complex model, and believe this better captures the underlying data generating process, but estimates from such a model may be hard to interpret.
We define the projection of the CIDE onto ) over weighted 2 ℓ distance.Specifically, we define * β as the coefficients corresponding to the least-squares projection, and One could also incorporate a weight function and use a different distance metric [45].We set the weights to 1 and focus on 2 ℓ distance for ease of exposition, but all our results follow with other weights and could be extended to other distance metrics.

As long as g v β ; (
) is differentiable with respect to β, * β is the solution of a moment condition.The moment condition corresponds to the first derivative of the right-hand side of ( 14) with respect to β, Then, the solution * β in ( 14) satisfies = * m β 0 ( ) in (15), and the projection of the CIDE onto the working model is Remark 3. Our projection approach is different from the proper semiparametric approach, since the definition of * β in equation ( 14) does not explicitly assume anything about the true conditional effect curve.By contrast, a proper semiparametric approach assumes a finite dimensional model is correctly specified for the conditional effect curve [52][53][54][55].
It is possible to derive an efficient influence function and thus a semiparametrically efficient estimator for the moment condition m β ( ), and by extension for * β and * ).We use this efficient influence function to construct the Projection-Learner.The primary building block for the efficient influence function of the moment condition is the un-centered efficient influence function for the relevant average effect.The efficient influence functions for the average incremental effect and the average incremental contrast effect were derived in Corollary 2 in Kennedy [11], and are stated in equations ( 45) and ( 46) in Appendix.Meanwhile, Lemma 1 establishes the efficient influence function for the average incremental derivative effect.Lemma 1.Under Assumptions 1 and 2, the un-centered efficient influence function for the average incremental derivative effect, τ V δ where ω X δ ; ( ) is defined in equation (10).
The un-centered efficient influence function, ξ Z δ ; ( ), depends only on the nuisance functions μ A X , ( )and π X ( ), and consists of three terms.The first term is a product of the weight term, ω X δ ; ( ), and an inverse weighted residual for the outcome model.The second term is a product of the difference in mean outcomes, ), and an inverse weighted residual for the propensity score.And, the third term is the "plug- in" for the CIDE.
Remark 4. Throughout, we invoke Assumptions 1 and 2 so that the target of estimation is some counterfactual quantity (e.g., the CIDE).If these assumptions do not hold, the results still apply if the targets of estimation are the observed data functionals on the right hand side of the identification results in Propositions 1 and 2.
From Lemma 1, and Corollary 2 in Kennedy [11], we can derive the efficient influence function for the moment condition m β ( ) for estimating the projection of the CIDE, the CIE, or the CICE.
)denote the true influence function values of the relevant average effect, where ξ Z δ ), and is defined analogously, as shown in equations ( 45) and (46) Under Assumptions 1 and 2, the un-centered efficient influence function for m β ( ) with unknown propensity scores and a uniform weight function constructed over 2 ℓ distance is ) is the working model.

Corollary 1 motivates estimators for
).The first step estimates the un-centered efficient influence function values for the relevant average effect; for example, when estimating the projection of the CIDE, we have where μ X μ X 0, , 1,   ( ) ( ), and π X ( ) are (possibly nonparametric) estimates of the nuisance functions.The second step estimates the population moment condition by solving the empirical moment condition using the estimated un-centered efficient influence function values for m β ( ), We state the Projection-Learner formally in the following algorithm.
(2) On the estimation data D 2 , estimate the un-centered influence function values ξ Z δ ;  ( ) using the models ), and π X ( ) from step 1, where ξ Z δ ;  ( )is defined in (17) if the conditional effect of interest is τ cide , and analogously for τ cie and τ cice in equations ( 45) and (46) in Appendix.
(3) On the estimation data D 2 , estimate β  by solving the empirical moment condition Algorithm 1 is relatively straightforward.For example, if the working model is This can be achieved in the R language and environment for statistical computing and graphics [20] by running the regression (which implicitly includes an intercept by convention) model <lm(formula = xihat ~V + I(V^2)) where xihat is calculated from estimated nuisance functions μ A X , ( ) and π X ( ).
Remark 5.The structure of Algorithm 1 and the example code illustrate that the Projection-Learner uses estimated un-centered efficient influence functions values for τ V δ ; cide { ( )} as pseudo-outcomes in a para- metric least squares second stage regression.In Section 3.2, we show that the I-DR-Learner follows the same form, but with a nonparametric second stage regression.
To guarantee the convergence rates demonstrated in Theorem 1, we could assume Donsker-type or lowentropy conditions for the nuisance functions μ A X , ( ) and π X ( ), which might restrict what types of flexible estimators we can use; e.g., this could preclude us from using the LASSO and many types of random forests and neural networks [34,56,57].Instead, we use sample splitting in step 1 of Algorithm 1 to estimate the nuisance functions; i.e., we split our sample in two and estimate the nuisance functions on the training data, D 1 , and calculate ξ Z δ ;  ( )and solve the empirical moment condition on the estimation data, D 2 .Sample splitting allows us to condition on the training sample and treat the estimated nuisance functions as fixed functions, which allows for a large class of estimators for estimating the nuisance functions (including, e.g., the LASSO, random forests, and neural networks).A concern one might then have with Algorithm 1 is that it only estimates β  on half the sample.To utilize the whole sample for inference, we can improve on Algorithm 1 with cross-fitting by estimating the nuisance functions on both folds (D 1 and D 2 ), constructing ξ Z δ ;  ( ) values on the opposite fold (i.e., by estimating ξ Z δ ;  ( )in D 1 using nuisance functions constructed on D 2 , and vice versa), and solving the empirical moment condition on the whole dataset (D 1 and D 2 together) [37,58,59].This cross-fitting approach is also compatible with more folds ("k-fold cross-fitting"), which can be more stable than two-fold cross-fitting.
The following theorem shows that the estimator β  for * β outlined in Algorithm 1 converges to an asymptotically linear expansion around * β where the bias is expressed as a product of errors from esti- mating the nuisance functions μ X μ X 0, , 1,   ( ) ( ), and π X ( ).For this result, and the rest of Section 3.1, we let )} and = π π X ( ) denote the generic nuisance functions, μ  and π  denote the nuisance function estimators, and define * μ and * π as the true nuisance functions (consistent with the projection notation * β ).
denote the centered efficient influence function from Corollary 1.With Assumptions 1 and 2, also assume ), with nonsingular derivative matrix Then, Theorem 1 provides a convergence statement for the coefficient estimate β  to the true projection para- meter * β under relatively weak conditions.Assumption (a) says that the CATE and the estimate of the CATE are bounded.Assumption (b) says that the derivative of the model g v β ; ( ) with respect to β is bounded, which is quite weak and can be enforced through choice of an appropriate model.Assumption (c) ensures that the influence function φ is not too complex as a function of β, but allows for arbitrary complexity in the nuisance functions; again, this can be enforced with appropriate choice of g v β ) and for any fixed v we have Corollary 2 provides a way to construct an asymptotically valid Wald-style 1-α confidence interval around where ⋅ Φ( ) is the cumulative distribution function for the standard normal, )is an estimate of the derivative matrix.Furthermore, this corollary demonstrates that β  and g v β ;  ( ) converge at − ∕ n 1 2 rates to Gaussian distributions, centered at * β and * g v β ; ( ), respectively, with less stringent model-agnostic convergence conditions on the nuisance function estimators μ X μ X 0, , 1,   ( ) ( ), and π X ( ).Thus, β  and g v β ;  ( )still attain − ∕ n 1 2 convergence rates if both nuisance functions are estimated at − ∕ n 1 4 rates, which are attainable with nonparametric estimators under relatively realistic assumptions such as smoothness or sparsity [60][61][62][63].Remark 6.These results are doubly-robust in spirit since the remainder bias is expressed as a product of nuisance function errors.However, there is no "double robustness" in the traditional sense, which would only require Instead, Corollary 2 requires that the propensity score is estimated well enough that Intuitively, this occurs because incremental interventions shift the observed propensity scores and thus require a good estimate of the propensity score.By contrast, the intervention corresponding to the CATE does not depend on the propensity score, so the convergence rate for the propensity score estimator is less critical, depending on that of the outcome regression.
As demonstrated in Theorem 1 and Corollary 2, the Projection-Learner can attain − ∕ n 1 2 convergence rates to the projection of the true CIDE (or CIE, or CICE) onto the chosen working model g v β ; ( ). If, instead, we wish to target the true conditional effect curve, and that curve does not coincide with the projection, then we need to use a different estimator, as described in Section 3.2.
Conditional incremental effects  13

The I-DR-Learner
In this section, we outline the I-DR-Learner and illustrate its convergence properties.The I-DR-Learner targets the true conditional effects.Since the conditional effects are not pathwise differentiable, no efficient influence function exists for them.Instead, the I-DR-Learner makes use of the efficient influence function values for the relevant average effect by regressing them against the covariates of interest to estimate the conditional effect.In this way, the I-DR-Learner is a two-stage meta-learner, where the first stage estimates the efficient influence function values for the relevant average effect, and the second stage uses these values as pseudo-outcomes in a second stage regression against the conditioning covariates.The I-DR-Learner is stated formally in the following algorithm: ( ), which denote two independent samples of n observa- ), and π X ( ).
(2) On the estimation data D 2 , estimate the un-centered influence function values ξ Z δ ;  ( ) using the models ), and π X ( ) from step 1, where ξ Z δ ;  ( )is defined as in (17) if the conditional effect of interest is τ cide , and analogously for τ cie and τ cice in equations ( 45) and ( 46) in Appendix.
(3) In the estimation sample D 2 , regress ξ Z δ ;  ( ) on the conditioning covariates V to obtain the estimate Like the Projection-Learner, the I-DR-Learner also uses sample splitting and estimates the nuisance functions on a separate sample to avoid imposing Donsker-type conditions on the nuisance function estimators.The I-DR-Learner is also compatible with cross-fitting.
The I-DR-Learner can estimate all three conditional effectsthe CIE, CICE, and CIDE.Furthermore, the error of the estimator is asymptotically equal to that of an oracle estimator under certain conditions.Specifically, the second stage regression must satisfy the stability condition in Definition 1 [3].This is a generalization of the classic stochastic equicontinuity condition to nonparametric regression (Lemma 19.24 [34]), and says that the second stage regression is stable with respect to a distance metric d if the difference between second stage regressions with estimated outcomes and true outcomes shrinks appropriately fast.For example, in our context, a regression } at an appropriately fast rate.We discuss Definition 1 of Kennedy [3] in more detail in Appendix.The stability condition is satisfied by the class of linear smoothers (see [3] Theorem 1), which includes nearest neighbors estimators, regression splines, kernel smoothers, series regression, and certain random forests.It is possible that other classes of estimators also satisfy the stability condition, although examining that question is beyond the scope of this work.
Under this stability condition, the error of the I-DR-Learner can be tied to the error of an oracle estimator, which would have access to the un-centered efficient influence function values for the relevant average effect and would estimate the conditional effect merely by running a regression of ξ Z δ ; ( )against V .This approach was considered in Kennedy [3] for estimating the CATE, and their Theorem 2 showed that, under certain assumptions, the error of their DR-Learner will only exceed the error of an oracle estimator by an amount that depends on the product of errors in estimating the nuisance functions.The same logic holds for the I-DR-Learner, and we formally state the convergence result in the following theorem.We slightly amend notation from Section 3. and Assumption (a) from Theorem 1, also assume that the second stage regression is stable according to Definition 1 given in the study by Kennedy [3].Then, Theorem 2 shows that error for the I-DR-Learner differs from the error for the oracle estimator by at most )), that are asymptotically negligible compared to the error of the oracle estimator.Thus, whether the I-DR-Learner achieves oracle efficiency is driven by the asymptotic behavior of the smoothed bias term }.This bias term is asymptotically less than the product of errors for estimating μ x ( ) and π x ( ), , and the squared error for estimating π x ( ), − π x π x 2  { ( ) ( )} .Therefore, the convergence rate of the I-DR-Learner is faster than the con- vergence rate of the nuisance function estimators.For example, if the nuisance functions are estimated at − ∕ n 1 4   rates, then the bias term b X  ( ) will converge to zero at a − ∕ n 1 2 rate.Importantly, Theorem 2 does not require any assumptions about how the estimators μ  and π  are constructed, beyond the boundedness conditions from Assumption (a) from Theorem 1.
However, the performance of the I-DR-Learner is also constrained by the oracle convergence rate for the second stage regression.For example, if } is Hölder-smooth with smoothness s, then the minimax rate in root mean squared error is . This is not surprising, since the conditional effect is a regression function, if we are only willing to assume it lies in a large nonparametric class, then the minimax rate of convergence will be slower than − ∕ n 1 2 .One can also think of the slower oracle convergence as a positive aspect to the I-DR-Learner, since it reduces how well the nuisance functions must be estimated to achieve oracle efficiency.For example, if the oracle convergence rate is "only" − ∕ n 1 4 , then the I-DR-Learner can estimate each nuisance function at − ∕ n 1 8 convergence rates and still attain oracle efficiency.When the nuisance functions are estimated well enough and the I-DR-Learner is oracle efficient, confidence bands can be constructed following well-known processes for nonparametric regression [64].
Both the Projection-Learner and the I-DR-Learner can be used to estimate conditional effect curves across δ and V , thereby quantifying how causal effects vary across V .A natural question is whether there is any treatment effect heterogeneity across V .In Section 4, we outline how to quantify and test for treatment effect heterogeneity.

Understanding effect heterogeneity with the V-CIDE
There is a large literature for understanding treatment effect heterogeneity by summarizing the CATE (e.g., [65][66][67][68]).In this section, we demonstrate how the V-CIDE, defined in equation (11), can be used to understand effect heterogeneity.To ease exposition, we focus on the case where = V X, and examine effect heterogeneity across all covariates.The case where V is a strict subset of X (i.e., ⊂ V X) is outlined in Appendix.By Proposition 2, the V-CIDE is identified by When the V-CIDE is zero, the derivative is constant across V , and so shifting the treatment distribution has the same effect on all subjects.If the V-CIDE is greater than zero, then there is treatment effect heterogeneity in the incremental effect.
Conditional incremental effects  15 We construct an estimator in two pieces by first noting that the V-CIDE is the difference between two effects since ) }, admits an efficient influence function by the following lemma: Lemma 2. Under Assumptions 1 and 2, the un-centered efficient influence function for τ X δ where ω X δ ; ( ) is defined in equation (10) and Lemma 2 shows that the un-centered efficient influence function for τ X δ ; cide 2 { ( ) } can be written as a weighted residual plus a plug-in.The second effect, τ X δ ; cide 2 { ( )} , is also pathwise differentiable and admits an efficient influence function.However, since it is a smooth transformation of an already pathwise differentiable function, we estimate it by squaring the estimator based on the efficient influence function for τ X δ ; cide { ( )} provided in Lemma 1.Therefore, informed by Lemmas 1 and 2, we propose the estimator where we omit δ, X , and Z arguments and let = μ μ a X , a ( ) for brevity, and where ω φ ϕ , ,    indicate the relevant formulae from (10), (18), and ( 19), but with the estimated nuisance functions (e.g., ), and π X ( ).
As before, Algorithm 3 uses sample splitting to estimate the nuisance functions, which allows for estimating the nuisance functions with flexible machine learning models.Again, this estimator could use crossfitting by repeating the algorithm but with D 1 and D 2 reversed and then averaging the two estimates.We establish the error guarantees of the estimator in the following result.
Theorem 3 shows that the estimator for the V-CIDE satisfies a version of double robustness under relatively weak conditions.Assumption (a) says that the efficient influence function for the average derivative and the estimate for the efficient influence function are bounded, which is a mild assumption.Then, if both nuisance function estimators converge at − ∕ n 1 4 rates, the standardized difference between the estimator and the V-CIDE has a Gaussian limiting distribution.This is a slightly stronger requirement than that of Corollary 2, since both nuisance functions must be estimated at − ∕ n 1 4 rates, not just the propensity score.This occurs due to the nonlinearity of τ X δ ; cide 2 { ( ) } in terms of μ.Nonetheless, this result is still model- agnostic about the nuisance function estimators, and the convergence requirement can be satisfied by nonparametric estimators under suitable smoothness or sparsity.Theorem 3 suggests constructing Wald-style − α 1 confidence intervals with where σ 2  is the sample variance estimator for σ 2 defined in equation ( 22); i.e., where n denotes the sample variance.
Unfortunately, the estimator in Algorithm 3 converges to a degenerate distribution when = τ X δ ; 0 cide { ( )} because the efficient influence function values are identically zero, and = σ 0 2 in equation (22).So, the confidence interval in (23) where are, respectively, consistent estimators of the variance of the estimators in equations ( 20) and ( 21) for τ X δ ; { ( )} .This confidence interval suggests the following one-sided test for treatment effect heterogeneity: Conditional incremental effects  17 This test controls Type I error at the appropriate level, as shown in the following result.
Proposition 3.Under Assumptions 1 and 2, Assumption (a) from Theorem 1, and Assumption (a) from Theorem 3, if then the asymptotic Type I error rate of the test in (28) is less than or equal to α.
Remark 7. In the causal inference literature, at least two other solutions have been proposed for constructing confidence intervals when an estimator converges to a degenerate distribution.Our approach is similar to that reported in the study by Williamson et al. [69], where they focus on testing variable importance.Luedtke et al. [68] proposed a different approachthey derived the higher order influence function for their parameter and constructed an associated estimator that achieves − n 1 convergence under − ∕ n 1 4 conditions on the nuisance function estimators.
Remark 8.When we do not have knowledge of the true parameter value, and we want to construct a valid confidence interval (rather than conduct a test), we can combine the confidence intervals in ( 23) and ( 25) and construct a conservative confidence interval with A benefit of the V-CIDE is that it does not require positivity, unlike standard tests for effect heterogeneity like those based on the variance of the CATE (e.g., tests that use the statistic τ X cate { ( )} [68]).However, a drawback of our approach is that the test based on the V-CIDE may be less powerful than those based directly on the CATE, depending on δ and the distribution of the propensity score, which determine the magnitude of the weight that multiplies the CATE in the identified formula for the V-CIDE on the right-hand side of equation (13).
In the appendix, we illustrate several simulations that demonstrate the properties of the Projection-Learner and I-DR-Learner.In short, the Projection-Learner achieves correct coverage for the projection parameter and the I-DR-Learner achieves oracle efficiency when the nuisance functions are estimated well enough.In Section 5, we apply these estimators to real ICU data and demonstrate how they can uncover interesting phenomena that would be obscured by looking at effects with deterministic interventions, like the ATE and the CATE.

Data analysis of the effect of ICU admission on mortality
In this section we illustrate the I-DR-Learner and the estimator for the V-CIDE by analyzing data from the (SPOT)light prospective cohort study in which investigators collected data on ICU transfers and mortality.These data are a cohort study collected between November 1st, 2010 and December 31st, 2011 of 13,011 patients with deteriorating health who were assessed for critical care unit admission across 49 National Health Service hospitals in the UK [8,70].
Previous literature has considered whether admission to the ICU reduces mortality [71,72], where the relevant exposure of interest is a binary indicator of whether someone was admitted to the ICU.Recent analyses have estimated the ATE or used ICU bed availability as an instrumental variable to estimate the local average treatment effect (LATE) [8].Flexible estimation of the ATE finds that the ICU is harmful, whereas estimates for the LATE find a null effect, albeit with wide confidence intervals.However, there are two limitations to focusing on the ATE or LATE for this application.First, the relevant counterfactual interventions where everyone is sent to the ICU or no one is sent to the ICU may not be feasible (e.g., the ICU might not have capacity to admit everyone), but an intervention where it is made more or less likely that people are sent to the ICU could be feasible.Second, one might expect a priori that the positivity assumption is violated, in the sense that some patientsdepending on their conditionmay be almost certain to be admitted or never be admitted to the ICU.Indeed, this is validated by the data, as shown in Figure 2; thus, an intervention that does not require positivity is desirable for this application.Finally, understanding effect heterogeneity would be of great interest in this application, since it may be the case that the ICU is helpful for some patients while unhelpful or even harmful for others.While there are other interventions one might consider for describing effect heterogeneity under realistic interventions that are robust to positivity violations [9,10,13,17], incremental interventions are a natural candidate because they take an intuitive parameterization for binary treatment and are robust to positivity violations when the propensity scores equal zero or one.

Data
The data contain 28-day mortality as an outcome variable and a binary indicator for whether someone was admitted to the ICU.The data also contain detailed demographic, physiological, comorbidity, and mortality information for all patients.In terms of demographic information, the data include age, sex, septic diagnosis (0/1), and peri-arrest (0/1).In terms of physiology data, there are three risk scores: the ICNARC physiology score [73], the NHS National Early Warning score [74], and the Sepsis-related Organ Failure Assessment score [75].Finally, the data also record the patient's existing level of care at assessment and recommended level of care after assessment, which were defined using the UK Critical Care Minimum Dataset levels of care.We used all these covariates in our analysis, and also included ICU bed availability, which is a binary measure of whether <4 ICU beds were available at the time of assessment.We use a publicly available version of the dataset, which contains the same number of observations sampled with replacement from the original datasetwe provide this dataset with our code at https://github.com/alecmcclean/NPCIE.

Method
We consider the counterfactual 28-day mortality rate if we increased or decreased the odds of ICU admission according to an incremental intervention.We use the I-DR-Learner to nonparametrically estimate the CIE and the CIDE over the ICNARC physiology score.We focus on the ICNARC score because it is a measure of the health risk of the patient (higher being riskier), and a natural question is whether the ICU affects healthier and sicker patients differently.Then, we estimate the V-CIDE to test for treatment effect heterogeneity across a continuum of policies.The nuisance functions π  and μ  -and τ δ cide  ( ) when estimating the V-CIDE -were estimated with random forests via the ranger package and the efficient influence function pseudo-outcomes for the CIE were constructed using the npcausal package in R [21,77].The I-DR-Learner second stage regression was estimated with a smoothing spline via the mgcv::GAM function in R [78].R code demonstrating how our analyses were implemented is provided at https://github.com/alecmcclean/NPCIE,and functions for estimating the average incremental derivative effect and V-CIDE are available in the npcausal package [21].

Results
Figure 2 shows estimated propensity scores by ICNARC score, which confirms prior intuition that positivity might be violated with these data, since for most ICNARC scores, there are estimated propensity scores very near 0 and 1. Figure 3(a) shows that the CIE varies across δ for all ICNARC scores.Estimated counterfactual 28- day mortality is lowest under the observed treatment process (when = δ 1), and increases when the odds of ICU admission increase ( > δ 1) or decrease ( < δ 1).We also see strong evidence that the CIE varies across ICNARC score, and mortality increases as the ICNARC score increases.This agrees with what one might expect, since the ICNARC is a risk measure where a higher ICNARC score denotes a patient with a higher risk of death.However, this does not necessarily suggest treatment effect heterogeneity, since one would need to consider a contrast between two levels of the CIE to understand effect heterogeneity.
To ease presentation, Figure 3(b) shows the CIE across δ for only four ICNARC scores (0, 15, 30, and 40 were chosen because they are roughly evenly spaced across the range of ICNARC scores).Examining only four curves shows more clearly that the shape of the CIE at each ICNARC score is very similar, suggesting that perhaps there is little heterogeneity.Figure 3(b) also illustrates a further nuance.Previous work has estimated the ATE and found that mortality rates would be higher if everyone were admitted to the ICU vs if no one was admitted [8].Taken at face value, this suggests that hospitals ought to send fewer people to the ICU; however, due to positivity issues in the data, ATE estimates are likely invalid.The difference between the endpoints of the curves in Figure 3 estimates, since the mortality rate at = δ 5 is higher than at = δ 0.2.But, by examining the curve across the spectrum of interventions, one would also observe that mild interventions correspond to small changes in mortality rates.Therefore, our analysis validates previous researchin the sense that it estimates mortality to be lower when no one is admitted to the ICU, compared to everyone are admittedbut it also suggests a different practical implication, since one would conclude from our analysis that realistic interventions might only lead to small changes in mortality rates.This highlights how examining a spectrum of interventions can be more informative than examining a contrast like the ATE.
Meanwhile, Figure 4 shows the CIDE across ICNARC score for five δ values, and shows that the CIDE is generally very near to zero, and is only significantly different from zero at a few points across δ and ICNARC score.Figure 5 shows there is significant treatment effect heterogeneity across ICNARC score with 95% confidence intervals, but that the magnitude of the effect is very small, since the estimate for the V-CIDE is very close to zero for all δ values.Taken together, Figures 4 and 5 demonstrate that there is little effect hetero- geneity across ICNARC scores, and increasing the odds of ICU admission affects subjects with different ICNARC scores similarly.

Discussion
In this work, we introduced three conditional effects based on incremental propensity score interventionsthe conditional incremental effect (CIE), conditional incremental contrast effect (CICE), and conditional incremental derivative effect (CIDE).We proposed two estimators, the Projection-Learner and the I-DR-Learner, which can be used to estimate any of the three conditional effects.We showed that the Projection-Learner, a projection approach, achieves parametric efficiency under weak − ∕ n 1 4 conditions on the nuisance function estimators and that the I-DR-Learner, a nonparametric estimator, achieves oracle efficiency under similarly weak conditions.We also proposed a fourth effect, the variance of the CIDE (V-CIDE), which is a one-dimensional summary of effect heterogeneity.For the V-CIDE, we proposed a new estimator also with double robust style properties, and outlined methods for inference and testing for treatment effect heterogeneity.
Finally, we illustrated our methods with a real data analysis of the effect of ICU admission on mortality conditional on a patient's risk score.This analysis demonstrated that estimating counterfactual mean outcomes across a spectrum of incremental interventions can be more informative than just estimating the ATE.We found evidence that the ATE is positive, suggesting that sending no one to the ICU is better than sending everyone to the ICU in terms of average mortality rates.However, by examining the spectrum of incremental interventions, we estimated that average mortality changes little with mild changes in ICU admission rates.Further, we found that there is indeed statistically significant treatment effect heterogeneity across patient risk scores, but the magnitude of heterogeneity is small.An additional limitation of this analysis is that it assumed there were no unmeasured confounders, which might be implausible.The sensitivity of the results to possible unmeasured confounders could be examined in future work.To our knowledge, sensitivity analyses for incremental propensity score interventions have not yet been developed.
There are other interesting avenues for future investigation.Here we proposed CIE estimators with the simplest data generating setupone time point and binary treatment.Several natural extensions of this work to more complex frameworks are (i) time-varying data, (ii) incremental parameters that can depend on covariate data or past data, and (iii) multi-valued or continuous treatments with different stochastic interventions.Since positivity violations are almost guaranteed with time-varying data or multi-valued or continuous treatment, it would also be important to understand how nonparametric estimators behave and how projection approaches might be utilized to approximate ATE-style effects when positivity is violated.

Appendix A Stability condition for Theorem 2
In this section, we state the stability condition invoked in Section 3.2 and Theorem 2. This stability condition is described in detail in Section 3 of Kennedy [3], and can be viewed as a form of stochastic equicontinuity for nonparametric regression.

{ }
are independent training and estimation samples of n observations where ⊂ X Z are covariates (e.g., = Z X A Y , , ) be an estimate of some function of the data, f z ( ), using the training data ) denote a generic regression estimator that regresses outcomes Then, the regression estimator n  is defined as stable (with respect to a distance metric d) if Definition 1 says that the difference between the regression estimate with estimated outcomes }) converges to zero appropriately fast.This defini- tion can be viewed as a generalization of the classic stochastic equicontinuity condition }, and the denominator ∕ n 1 is replaced by the pointwise root mean squared error of the oracle estimator, ) .This stability condition is satisfied by linear smoothers, as is demonstrated in Kennedy [3] Theorem 1, and may be satisfied by more classes of estimators.
B Estimator for the variance of the conditional incremental derivative effect (V-CIDE) when the conditioning covariate is a strict subset of all covariates In this section, we briefly outline an estimator for the V-CIDE when the conditioning covariates V are a strict subset of all covariates X (i.e., ⊂ V X) and discuss the convergence properties of the associated estimator.As a reminder, the variance of the CIDE is identified by By the definition of the variance and iterated expectation, The squared expectation term, τ X δ ; cide 2 { ( )} , is the same as what appears when = V X, and therefore, can be estimated as in equation ( 21) in Section 4. The expected square term is new, and we derive the efficient influence function in the following result: Lemma 3.Under Assumptions 1 and 2, the un-centered efficient influence function for τ V δ ), and = = ω ω X δ φ φ Z ; , ( ) ( ), and = ϕ ϕ Z δ ; ( ) as defined in equations (10), (18), and This result suggests the following estimator: .
And, combined with the results in Sections 3.2 and 4, suggests the following estimator for the V-CIDE: ), and π X ( ).
(2) On the estimation data D 2 , estimate the un-centered influence function values ξ Z δ ;  ( )using the models μ  and π  from step 1, where ξ Z δ ;  ( ) is defined in (17) if the conditional effect of interest is τ cide , and analogously for τ cie and τ cice in equations ( 45) and ( 46).
(3) In the estimation sample D 2 , regress ξ Z δ ;  ( ) on the conditioning covariates V to obtain the estimate .
This estimator satisfies a similar double robustness condition to the estimator outlined in Section 4, but includes a dependence on τ V δ ; where ) as defined in equations (10), (18), and (19).
The theorem shows that the estimator for the V-CIDE satisfies a version of double robustness under relatively weak conditions.The result shows that our estimator attains − ∕ n 1 2 convergence to the V-CIDE under model-agnostic − ∕ n 1 4 convergence rates for the nuisance function estimators and the I-DR-Learner.This is a different result from Theorem 2, since it is required that the CIDE is estimated at − ∕ n 1 4 rates, but it is no longer required that μ  is estimated at − ∕ n 1 4 rates.As discussed in the body of the article, − ∕ n 1 4 rates are achievable with nonparametric estimators under suitable smoothness or sparsity.

Conditional incremental effects  27
Like the estimator from Algorithm 3, the estimator in Algorithm 4 converges to a degenerate distribution when the V-CIDE equals zero.As discussed at the end of Section 4, we can construct a valid test for any treatment effect heterogeneity by overestimating the variance of the estimator ψ n  .When ⊂ V X, we can construct the following asymptotically valid − α 1 test where C Simulations for the Projection-Learner and I-DR-learner In this section, we study the performance of the Projection-Learner and I-DR-Learner for estimating the conditional incremental contrast effect (CICE) with = δ 5 u and = δ 0.2 l ; R code for these simulations is provided at https://github.com/alecmcclean/NPCIE.As a reminder, the CICE is defined as which corresponds to the difference between the counterfactual mean outcomes when the odds of treatment are multiplied by 5 minus the counterfactual mean outcome when the odds of treatment are divided by 5.For all the analyses, we simulate 1,000 times datasets of sizes ∈ n 1,000, 10,000 { } .In each dataset, we generate X A Y , , and ∈ Y .For each dataset, we specify a quadratic CICE, and the CATE is defined implicitly as which follows from the identification result in Proposition 1.Each simulated dataset {( ) } is then constructed in the following manner: A π ~Bernoulli , ( ) The data generating process is also illustrated in We simulated estimates for the propensity scores and regression functions by adding noise, parameterized by α, to the true nuisance functions: .
The α parameter allows us to control how well the nuisance functions are "estimated."For example, when = α 0.1, this corresponds to estimating a nuisance function with error converging at − ∕ n 1 10 .We scale the error for the regression functions μ by the range of regression function values; this is purely a computing trick so that the error for neither nuisance function dominates the other, and this does not affect the convergence rates of the estimators.
First, we compare the I-DR-Learner to the oracle estimator ("Oracle I-DR-Learner") and a baseline learner ("Baseline CICE") in terms of integrated mean squared error (MSE).The oracle estimator constructs the true influence function value from μ a X , ( ) and π X ( ) and regresses them against X .Both the oracle estimator and the I-DR-Learner use the smooth.splinefunction in R for the second stage regression.The baseline estimator is a plug-in estimator which calculates This is motivated by causal identification, such as the result in Proposition 1, and does not make use of the efficient influence function for the relevant average effect.The baseline estimator for the CATE was previously examined in the literature, and has been referred to as the "T-Learner" [4].Conditional incremental effects  29 The results of these simulations are summarized in Figure A2.Each different panel corresponds to a different convergence rate α μ for estimating μ  and sample size, and α μ increases from left to right while = n 1,000 is in the top row and = n 10,000 is in the bottom row.The x-axis shows the convergence rate α π for estimating π , and the y-axis shows the integrated MSE for each estimator.Finally, each estimator is denoted by a different color, and the points and whiskers show the sample mean value and 95% confidence interval for the MSE over 1,000 simulations.
Figure A2 illustrates the phenomenon anticipated by Theorem 2. The oracle estimator the best, which we would expect since it has access to the true nuisance functions.The I-DR-Learner performs the next best, and its error approaches that of the oracle estimator as + α α α min 2 , π π μ ( ) increases, while the baseline learners, whose error is additive in terms of the nuisance function estimators' errors, fare the worst.

C.1 Coverage of the Projection-Learner
In this subsection, we outline results for the Projection-Learner.Specifically, we show that the projectionlearner achieves approximately correct coverage for the true coefficients in the model.The true model is and the working model is

( )
Since the working model is well-specified, the Projection-Learner estimates the true coefficients.Figure A3 shows the coverage of 95% confidence intervals constructed for each coefficient using the sandwich variance as in Corollary 2. When the nuisance function estimators have large error, such that + < α α 0.5  ) is continuous within x and δ and its partial derivative with respect to δ is absolutely integrable, then the CIDE is identified by where Proof.We have where the first line follows by definition, the second by 1 and 2, and the third and final line by exchanging expectation and derivative, taking the derivative with respect to t, rearranging, and setting = t δ. □ Conditional incremental effects  31 E Proofs for results in Section 3 Lemma 1.Under Assumptions 1 and 2, the un-centered efficient influence function for the average incremental derivative effect, τ V δ where ω X δ ; ( ) is defined in equation (10). Proof where and are two different distributions at which the functional ψ is evaluated.Rearranging, we can see that where denotes expectation under the distribution .By the definition of ξ , ) is second order, since We show this in the postscript to this proof.Here we provide some intuition.It is clear that the first line of ( 43) is already a second order.The second line of ( 43) is second order because ϕ is the un-centered efficient influence function of ω, and so the second multiplicand on the second line is the error term for estimating ω, which we would expect to be second order.The third line of ( 43) is more intuitively second order because it is the product of the errors of two plug-ins, and that product can be expressed as a product of errors.Now, we relate ξ Z δ ; ( ) back to scores of smooth parametric submodels.Recall from semiparametric efficiency theory that the nonparametric efficiency bound for a functional is given by the supremum of Cramer-Rao lower bounds for that functional across smooth parametric submodels [31,35].The efficient influence function is the unique mean-zero function ξ that is a valid submodel score satisfying pathwise differentiability, i.e., for ε any smooth parametric submodel.
with R 2 defined in (43), and where the second line follows because ψ ( ) does not depend on ε, the third because ∫ , and the fourth and final line since which shows that ξ satisfies the property in (44).The last equation involving R 2 follows because R 2 consists of only second-order products of errors between ε and .Therefore, the derivative is composed of a sum of terms, each of which is a product of either a derivative term that may not equal zero and an error term involving the differences of components of and ε , which will be zero when = ε 0 since = ε .Since the model is nonparametric, the tangent space is the entire Hilbert space of mean-zero finitevariance functions, and so there is only one influence function satisfying (44) and it is the efficient one [32] ) is second order as stated above.Starting with the third line in (43), and omitting arguments, we have For the second line in (43), we have This second order term can be expressed as a product of errors, as is shown in post script to the proof for Lemma 1.Therefore, since our model is )is Donsker in β for any fixed μ π , .
(d) The estimators are consistent in the sense that − ), with nonsingular derivative matrix Then, Proof.This proof follows closely both Lemma 3 from the study by Kennedy et al. [45] The first term appears directly in the statement in the theorem, so we will not manipulate it.It is a sample average of a fixed function, and so by the central limit theorem it will be asymptotically Gaussian.The second and third terms are empirical process terms.The fourth term can be linearized in ), and we will use this rearrange and solve for the statement in the theorem.The fifth term captures the nuisance estimation error, and appears implicitly in the statement in the theorem if we define   ; , , ; , , With Assumptions 1 and 2, and Assumption (a) from Theorem 1, also assume that the second stage regression is stable according to Definition 1 in the study by Kennedy [3].Then,

Figure 1 :
Figure 1: CIE curves for select ICNARC scores.The x-axis represents the incremental intervention parameter δ, where =δ 1 corresponds to no intervention, and > δ 1 and < δ 1 correspond to increasing and decreasing the likelihood of admission to the ICU, respectively.The y-axis shows estimated mortality rate.The curves depict the estimated CIE for different ICNARC scores, which measure mortality risk.Our analysis shows that for each ICNARC score, the mortality rate decreases when fewer ( < δ 1) people are sent to the ICU and increases when more ( > δ 1) people are sent to the ICU, but mild changes in ICU admission rates lead to minimal changes in mortality rate.

Figure 3 :
Figure2shows estimated propensity scores by ICNARC score, which confirms prior intuition that positivity might be violated with these data, since for most ICNARC scores, there are estimated propensity scores very near 0 and 1. Figure3(a) shows that the CIE varies across δ for all ICNARC scores.Estimated counterfactual 28- day mortality is lowest under the observed treatment process (when = δ 1), and increases when the odds of ICU admission increase ( > δ 1) or decrease ( < δ 1).We also see strong evidence that the CIE varies across ICNARC score, and mortality increases as the ICNARC score increases.This agrees with what one might expect, since the ICNARC is a risk measure where a higher ICNARC score denotes a patient with a higher risk of death.However, this does not necessarily suggest treatment effect heterogeneity, since one would need to consider a contrast between two levels of the CIE to understand effect heterogeneity.To ease presentation, Figure3(b) shows the CIE across δ for only four ICNARC scores (0, 15, 30, and 40 were chosen because they are roughly evenly spaced across the range of ICNARC scores).Examining only four curves shows more clearly that the shape of the CIE at each ICNARC score is very similar, suggesting that perhaps there is little heterogeneity.Figure3(b) also illustrates a further nuance.Previous work has estimated the ATE and found that mortality rates would be higher if everyone were admitted to the ICU vs if no one was admitted[8].Taken at face value, this suggests that hospitals ought to send fewer people to the ICU; however, due to positivity issues in the data, ATE estimates are likely invalid.The difference between the endpoints of the curves in Figure3(b) (i.e., = = τ v δ δ ; 5, 0.2 u l cice ( )) suggests a similar conclusion to that implied by ATE

Theorem 4 .
Let ψ n  denote the estimator from Algorithm 4.Under Assumptions 1 and 2, Assumption (a) from Theorem 1, and Assumption (a) from Theorem 3, if Figure A1.The covariate data X is one dimensional and uniform over −4, 4 [ ].The propensity score, shown in the top panel of Figure A1, follows a logistic model, which remains within reasonable bounds on the support of X , since − ≈ π 4 0.12 ( ) and ≈ π 4 0.88 ( ) .The outcome of regressions are complicated discontinuous functions and are shown in the middle panel of Figure A1.The treatment A and outcome Y are defined implicitly from the propensity score and regression functions.The second panel also shows the CATE, which is a smooth function.This is because the CICE and the propensity scores are smooth functions.The CICE is shown in the bottom panel of Figure A1.

π
then the confidence intervals have poor coverage.As the errors of the nuisance function estimators decrease, the coverage of the confidence intervals for each coefficient approach 95%.

Figure A3 :
Figure A3: Coverage of Projection-Learner confidence intervals for coefficient estimates.
By adding and subtracting on the right hand side of the equation above, omitting Z, and letting = First, we will tackle the second and third terms.Under the Donsker and consistency conditions in Assumptions (c) and (d), the second term is∕ o n 1 () by Lemma 19.24 of the study by van der Vaart[34].Further, under the consistency of φ β η ,   ( ) in Assumption (d) and by sample splitting, the third term is∕ o n 1 () by Lemma 2 of the study by Kennedy et al.[79].The fourth term, by the differentiability of the map ↦ β first line is a first-order Taylor expansion about * β and the second line follows by the consistency of * The final result follows by Cauchy-Schwartz and a boundedness condition, which depends on the target estimand.If the estimand is a projection of the CIDE, then the result follows by equivalent logic to the proof of Lemma 1 and the first part of Assumption (a), which says that the estimated CATE is bounded.If, instead, the estimand is a projection of the CIE or the CICE, the result follows by equivalent logic to the proofs of Lemmas 5 and 6 in the appendix of the study byKennedy [11]  and the second part of Assumption (a), which says that the true CATE is bounded.□ Theorem 2. Let − τ i dr stand in for τ τ , cide cie , or τ cice , and let ξ Z δ ; ( )denote the true influence function values of the relevant average effect.Furthermore, let = I-DR-Learner from Algorithm 2.
where the second line follows by the proof of Lemma 2. For the second term on the final line above, we see that line follows by Assumption (a), the fifth line because − sixth by Assumption (a) which implies ωτ is bounded, and by the proof of Lemma 1.Therefore, if − The second term from equation (49), the empirical process term, is simpler to bound.By Lemma 2 of the study by Kennedy et al.[79] and by sample splitting, we have 1, and allow ξ μ , , and π to denote the true efficient influence function values and nuisance functions.
i dr  ( ) denote the I-DR-Learner from Algorithm 2. With Assumptions 1 and 2, would under-cover the true parameter.Instead, we can construct a conservative estimate of the variance by noting that the efficient influence function values of τ Following essentially the same logic as in Lemma 1, by iterated expectations and rearranging, .We can re-arrange and by the non-singularity of the derivative matrix M in Assumption (e) we can pre- multiply both sides by * * − Conditional incremental effects  37 Proof.As in Lemma 1, we prove this result by showing that the estimand admits a von Mises expansion where the second-order term is a product of errors.
Proof.This follows from Proposition 1 of the study byKennedy [3], the definition of b x  ( ), and by iterated expectation.□