Individualized treatment rules under stochastic treatment cost constraints

Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particularly interesting special case, an instrumental variable corresponding to encouragement to treatment is intervened upon with constraints on the proportion receiving treatment. For such settings, we first develop a method to estimate optimal individualized treatment rules. We further construct an asymptotically efficient plug-in estimator of the corresponding average treatment effect relative to a given reference rule.


Introduction
The effect of a treatment often varies across subgroups of the population [38,52]. When such differences are clinically meaningful, it may be beneficial to assign treatments strategically depending on subgroup membership. Such treatment assignment mechanisms are called individualized treatment rules (ITRs). A treatment rule is commonly evaluated on the basis of the mean counterfactual outcome value it generates -what is often referred to as the treatment rule's value -and an ITR with an optimal value is called an optimal ITR. There is an extensive literature on estimation of optimal ITRs and their corresponding values using data from randomized trials or observational studies [6,21,25,37,54].
Most existing approaches for estimating ITRs do not incorporate real-world resource constraints.
Without such constraints, an optimal ITR would assign the treatment to members of a subgroup provided there is any benefit for such individuals, even when this benefit is minute. In contrast, under treatment resource limits, it may be more advantageous to reserve treatment for subgroups with the greatest benefit from treatment. This issue has received attention in recent work. Luedtke and van der Laan developed methods for estimation and evaluation of optimal ITRs with a constraint on the proportion receiving treatment [20]. Qiu et al. instead considered related problems in settings in which instrumental variables (IVs) are available [34]. In one of the settings they considered, the same resource constraint is imposed as in Luedtke and van der Laan [20] but a binary IV is used to identify optimal ITRs even in settings in which there may be unmeasured confounders. In another setting considered in Qiu et al. [34], the authors considered interventions on a causal IV or encouragement status, and developed methods to estimate individualized encouragement (rather than treatment) rules with a constraint on the proportion receiving both encouragement and treatment [33]. They also developed nonparametrically efficient estimators of the average causal effect of optimal rules relative to a prespecified reference rule. Sun et al. considered a setting in which the cost of treatment is random and dependent upon baseline covariates. They developed methods to estimate optimal ITRs under a constraint on the expected additional treatment cost as compared to control, though inference on the impact of implementing the optimal ITR in the population was not studied [41]. Sun [42] considered a related problem involving the development of optimal ITRs under resource constraints, and established the asymptotic properties of the estimated optimal ITR. Their method appears viable when the class of ITRs is restricted by the user a priori.
In this paper, we study estimation and inference for an optimal rule under two different cost constraints. The first is the same as appearing in Sun et al. [41]. In contrast to earlier work on this setting, we do not constrain the class of ITRs considered and provide a means to obtain inference about the optimal ITR. The second constraint we consider places a cap on the total cost under the rule rather than on the incremental cost relative to control. To our knowledge, the latter problem has not previously been considered in the literature. Both of these estimation problems mirror the intervention-on-encouragement setting considered in Qiu et al. [34] but involve different constraints and a more general cost function.
Similarly as in Qiu et al. [34], the estimators that we develop are asymptotically efficient within a nonparametric model and enable the construction of asymptotically valid confidence intervals for the impact of implementing the optimal rule. We develop our estimators using similar tools -such as semiparametric efficiency theory [31,51] and targeted minimum loss-based estimation (TMLE) [45,49] -as were used to tackle the related problem studied in Qiu et al. [34]. Consequently, our proposed estimators are similar to that in Qiu et al. [34]. Therefore, we will streamline the presentation by highlighting the key similarities and focusing on the differences between these related problems and estimation schemes.
The rest of this paper is organized as follows. In Section 2, we describe the problem setup, introduce notation, and present the causal estimands along with basic causal conditions. In Section 3, we present additional causal conditions and the corresponding nonparametric identification results. In Section 4, we present our proposed estimators and their theoretical properties. In Section 5, we present a simulation illustrating the performance of our proposed estimators. We make concluding remarks in Section 6. Proofs, technical conditions, and additional simulation results can be found in the Supplementary Material.

Setup and objectives
To facilitate comparisons with Qiu et al. [34], we adopt similar notation as in that work. Suppose that we observe independent and identically distributed data units O 1 , O 2 , . . . , O n ∼ P 0 , where P 0 is an unknown sampling distribution. A prototypical data unit O consists of the quadruplet (W, T, C, Y ), where W ∈ W ⊆ R p is the vector of baseline covariates, T ∈ {0, 1} is the treatment status, C ∈ [0, ∞) is the random treatment cost, and Y ∈ R is the outcome of interest. As a convention, we assume that larger values of Y are preferable. We use V = V (W ) ∈ V to denote a fixed transformation of W upon which we allow treatment decisions to depend. For example, V may be a subset of covariates in W or a summary of W (e.g., BMI as a summary of height and weight). In practice, V may be chosen based on prior knowledge on potential modifiers of the treatment effect as well as the cost of measuring various covariates. We distinguish between V (W ) and W because of their different roles. On the one hand, we will assume that the full covariate W contains all confounders and thus is used to identify causal effects, while V (W ) might not be sufficient for this purpose. On the other hand, some covariates in W may be expensive or difficult to measure in future applications, and thus implementing an optimal ITR based on a subset V (W ) of covariate W may be desirable. In the rest of this paper, we will use the shorthand notation V , V i and v to refer to V (W ), V (W i ) and V (w), respectively. We define an individualized (stochastic) treatment rule (ITR) to be a function ρ : V → [0, 1] that prescribes treatment with probability ρ(v) according to an exogenous source of randomness for an individual with covariate value v. Any stochastic ITR that only takes values in {0, 1} is referred to as a deterministic ITR.
In this work, we adopt the potential outcomes framework [27,39]. For each individual, we use C(t) and Y (t) to denote the potential treatment cost and potential outcome, respectively, corresponding to scenarios in which the individual has treatment status t. We use E to denote an expectation over the counterfactual observations and the exogenous random mechanism defining a rule, and E 0 to denote an expectation over observables alone under sampling from P 0 . We make the usual Stable Unit Treatment Value Assumption (SUTVA) assumption.
Condition A1 (Stable Unit Treatment Value Assumption). The counterfactual data unit of one individual is unaffected by the treatment assigned to other individuals, and there is only a single version of the treatment, so that T = t implies that C = C(t) and Y = Y (t).
Remark 1. The ITRs we consider are not truly individualized, because they are based on the value of covariate V rather than each individual's unique potential treatment effects Y (1) − Y (0) and C(1) − C(0).
Nevertheless, depending on the resolution of V , these ITRs can be considerably more individualized than assigning everyone to either treatment or control. In this paper, we adopt the conventional nomenclature and refer to the treatment rules we study as ITRs [see, e.g., 5,7,14,18,19,29,32,40,47,54,55,57].
We define C(ρ) and Y (ρ) to be the counterfactual treatment cost and outcome, respectively, for an ITR ρ under an exogenous random mechanism. We note that if ρ(v) ∈ (0, 1) for an individual with covariate v, an exogenous random mechanism is used to randomly assign treatment with probability ρ(v) and thus C(ρ) and Y (ρ) are random for this given individual. If ρ were implemented in the population, then the population mean outcome would be E [Y (ρ)], where we use E to denote expectation under the true data-generating mechanism involving potential outcomes (C(t), Y (t)) and exogenous randomness in ρ. We consider a generic treatment resource constraint requiring that a convex combination of the population average treatment cost and the population average additional treatment cost compared to control be no greater than a specified constant κ ∈ (0, ∞]. Consequently, an optimal ITR ρ 0 under this Here, α ∈ [0, 1] is also a constant specified by the investigator. Natural choices of α are α = 0, corresponding to a constraint on the population average additional treatment cost compared to control, and α = 1, corresponding to a constraint on the population average treatment cost. The first choice may be preferred when the control treatment corresponds to the current standard of care and a limited budget is available to fund the novel treatment to some patients. The second choice may be more relevant when both treatment and control incur treatment costs.

Remark 2.
Our setup is similar to that in Qiu et al. [34] if we view T and C defined here as the instrumental variable/encouragement Z and treatment status A defined in those prior works, respectively.
However, the constraint in our setup is different from the constraint E[ρ(V )C(ρ)] ≤ κ considered previously. In IV settings, the constraint in (1) with α = 1 is useful when assigning treatment always incurs a cost, regardless of whether encouragement is applied, such as in distributing a limited supply of an expensive drug within a health system based on the results of a randomized clinical trial. It is instead useful with α = 0 when no encouragement is present under the standard of care but intervention on the encouragement is of interest when additional treatment resources are available. The constraint considered in Qiu et al. [34,33] was instead useful in cases in which treatment only incurs a cost when paired with encouragement, such as when housing vouchers are used to encourage individuals to live in a certain area.
In the general setting in which T is viewed as treatment status and C as a random treatment cost, the constraint in (1) with α = 0 is identical to that considered in Sun et al.
[41] -we refer the readers to these works for a more in-depth discussion of the relation between the current problem setup and IV settings.
To evaluate an optimal ITR ρ 0 , we follow Qiu et al. [34] in considering three types of reference ITRs and develop methods for statistical inference on the difference in the mean counterfactual outcome between ρ 0 and a reference ITR ρ R 0 : V → [0, 1]. The first type of reference ITR considered, denoted by ρ FR (FR=fixed rule), is any fixed ITR that may be specified by the investigator before the study. When α = 0, it is usually most reasonable to consider the rule that always assigns control, namely v → 0, because the constraint in (1) may arise due to limited funding for implementing treatment whereas the standard of care rule is to always assign control. The second type, denoted by ρ RD 0 (RD=random), prescribes treatment completely at random to individuals regardless of their baseline covariates. The probability of prescribing treatment is chosen such that the treatment resource is saturated (i.e., all available resources are used) or all individuals receive treatment, if such a probability exists. Symbolically, this ITR is has the same interpretation as the corresponding encouragement rule in Qiu et al. [34], its mathematical expression is different due to the different resource constraint. This rule may be of interest if it is known a priori that treatment is harmless. The third type, denoted by ρ TP 0 (TP=true propensity), prescribes treatment according to the true propensity of the treatment implied by the study sampling mechanism P 0 , so that ρ TP 0 equals w → P 0 (T = 1 | W = w). This ITR may be of interest in two settings. In one setting, ρ TP 0 satisfies the treatment resource constraint. The investigator may wish to determine the extent to which the implementation of an optimal ITR would improve upon the standard of care. In the other setting, the treatment resource constraint is newly introduced and the standard of care ITR may lead to overuse of treatment resources. The investigator may then be interested in whether the implementation of an optimal constrained ITR would result, despite the new resource constraint, in a noninferior mean outcome.

Identification of causal estimands
In this section, we present nonparametric identification results. Though these results are similar to those for individualized encouragement rules in Qiu et al. [34], there are two key differences. First, the form of some of the conditions in Qiu et al. [34] need to be modified to account for the novel resource constraint considered here. Second, two additional conditions are needed to overcome challenges that arise due to this new constraint.
We first introduce notation that will be useful when presenting our identification results and our proposed estimators. For any observed-data distribution P , we define pointwise the conditional mean functions µ C P (t, w) := E P (C | T = t, W = w) and µ Y P (t, w) := E P (Y | T = t, W = w), where we use E P to denote an expectation over observables alone under sampling from P , and their corresponding contrasts due to different treatment status, ∆ C P (w) : We also define the average of these contrasts conditional on V as δ C , and the propensity to receive treatment µ T P (w) := P (T = 1 | W = w).
These quantities play an important role in tackling the problem at hand. Throughout the paper, for ease of notation, if f P is a quantity or operation indexed by distribution P , we may denote f P 0 by f 0 . As an example, we We introduce additional causal conditions we will require, positivity and unconfoundedness. In one form or another, these conditions commonly appear in the causal inference literature [49], including in the IV literature [1,15,43,53].
Equipped with these conditions, we are able to state a theorem on the nonparametric identification of the mean counterfactual outcomes and average treatment effect (ATE) -these results can be viewed as a corollary of the well-known G-formula [36].
Theorem 1 (Identification of ATE and expected treatment resource expenditure). Provided Condi- In view of Theorem 1, the objective function in (1) can be identified as and, similarly, the expected cost is identified as It follows that the optimization problem (1) is equivalent to This differs from Equation 3 defining optimal individualized encouragement rules in Qiu et al. [34]. We now present two additional conditions so that (2) is a fractional knapsack problem [9], thereby allowing us to use existing results from the optimization literature. These conditions are similar to those in Sun et al. [41].
Condition A4 (Strictly costlier treatment). There exists a constant C > 0 such that ∆ C 0 (w) > C holds for P 0 -almost every w.
Condition A4 is reasonable if treatment is more expensive than control. When applied to an IV setting as outlined in Remark 2, this condition corresponds to the assumption that the IV is indeed an encouragement to take treatment. This condition is slightly stronger than its counterpart in Sun et al. [41], which only requires that ∆ C 0 ≥ 0. This stronger condition is needed to ensure the asymptotic linearity of our proposed estimator in Section 4. Under Condition A4, it is evident that Condition A5 is reasonable because if αφ 0 > κ, then no ITR satisfies the treatment resource constraint in view of the and write η 0 := inf{η : Theorem 2 (Optimal ITR). Under Conditions A1-A5, a solution to (2) is explicitly given by Here, the first case is the boundary case with the randomization probability that saturates the treatment resource.
We also note that the reference ITRs introduced in Section 2 are also identified under the above conditions. In particular, it can be shown that

Estimating and evaluating optimal individualized treatment rules
In this section, we present an estimator of an optimal ITR ρ 0 and an inferential procedure for its ATE relative to a reference ITR ρ R 0 , where R is any of FR, RD or TP. The proposed procedure is an adaptation of the method first proposed in Qiu et al. [34,33].
We begin by introducing some notations that are useful for defining the estimands. We define the parameter Ψ ρ (P ) := E P ρ(V )∆ Y P (W ) or Ψ ρ (P ) := E P ρ(W )∆ Y P (W ) for each ITR ρ and distribution P ∈ M, depending on whether the domain of ρ is V or W. Here, we consider the model M to be locally nonparametric at P 0 [31]. For P ∈ M, the ATE of an optimal ITR ρ P relative to a reference ITR ρ R P equals Ψ R (P ) := Ψ ρ P (P ) − Ψ ρ R P (P ). We are interested in making inference about ψ 0 := Ψ R (P 0 ), where we have suppressed dependence on R from our shorthand notation.

Pathwise differentiability of the ATE
We first present a result regarding the pathwise differentiability of the ATE. Pathwise differentiability of the parameter of interest serves as the foundation for constructing asymptotically efficient estimators of this parameter, based on which an inferential procedure may be developed. Additional technical conditions are required and are provided in Section S1 in the Supplementary Material. For a distribution P ∈ M, a function µ C : {0, 1} × W → R, an ITR ρ, and a decision threshold τ ∈ R, we define pointwise the following functions: G(P )(o) := D(P, ρ P , τ P , µ C P )(o) ; One key condition we rely on is the following non-exceptional law assumption.
Under this condition, the true optimal ITR ρ 0 is identical to an indicator function. If all covariates are discrete, then we can plug in the empirical estimates into the identification formulae in Theorems 1-2 and show that the resulting estimators of the ATE are asymptotically normal by the delta method even when Condition B1 does not hold. We do not further pursue this simple case in this paper, and thus need to rely on the non-exceptional law assumption, namely Condition B1, to account for continuous covariates. We list additional technical conditions in Supplement S1.
We can now provide a formal result describing the pathwise differentiability of the ATE parameter.
We note that the pathwise differentiability of P → Ψ ρ R P (P ) was established in Theorem 3 of Qiu et al. [34] for R ∈ {FR, TP}. The other results can be proven using similar techniques. We put the proof of these results in Supplement S4.2. In view of Theorem 3, it follows that the ATE parameter Ψ R is pathwise differentiable at P 0 with nonparametric canonical gradient for R ∈ {FR, RD, TP}.
Remark 3. We have noted similar additional terms related to the resource being used in the canonical gradient of the mean counterfactual outcome or ATE of optimal ITRs under resource constraints, for example, in Luedtke and van der Laan [20] and Qiu et al. [34]. In our problem, this additional term is Such terms appear to come from solving a fractional knapsack problem with truncation at zero and take the form of a product of (i) the threshold in the solution, and (ii) a term that equals the influence function of the resource being used under the solution when the resource is saturated. We conjecture that such structures generally exist for fractional knapsack problems.

Proposed estimator and asymptotic linearity
We next present our proposed nonparametric procedure for estimating an optimal ITR ρ 0 and the corresponding ATE ψ 0 . We will generally use subscript n to denote an estimator with sample size n, and add a hat to a nuisance function estimator that is targeted toward estimating φ 0 .
1. Use the empirical distributionP W,n of W as an estimate of the true marginal distribution of W .
2. Estimate an optimal ITR: with a one-step correction estimator : if ξ n (v) = η n (k) and γ n (η n (k)) > 0 , The rule d n,k is the sample analog of an ITR that prescribes treatment to those with the highest values of ξ 0 (V ), regardless of whether treatment is harmful or not, until treatment resources run out.
(c) Compute k n , which is used to define an estimate of ρ 0 for which the plug-in estimator is asymptotically linear under conditions, as follows: then take k n to be this solution; • otherwise, set k n = κ.
3. Obtain an estimate ρ R n of the reference ITR ρ R 0 as follows: • For R = FR, take ρ R n to be ρ FR .
Takeμ C n to be the fitted mean model; • For R = TP, take ρ R n to be µ T n .
4. Estimate ATE of ρ 0 relative to the reference ITR ρ R 0 with a targeted minimum-loss based estimator (TMLE) ψ n : (a) obtain a targeted estimateμ Y n of µ Y 0 : run an ordinary least-squares linear regression with Takeμ Y n to be the fitted mean function.
(b) withP n being any distribution with componentsμ Y n andP W,n , take where ρ R n,i is defined as ρ R n (W i ) or ρ R n (V i ) depending on the covariate used by the reference ITR.
The above procedure is similar to that proposed in Qiu et al. [34]. One key difference is the use of the refined estimator k n of κ obtained via the estimating equation (5), which is key to ensuring the asymptotic linearity of ψ n . Another difference is that the denominator of ξ n is now δ C n , which is consistent with our different definition of the unit value for solving the fractional knapsack problem (2). Similarly to TMLE for other problems, when C or Y has known bounds (e.g., the closed interval [0, 1]), to obtain a corresponding targeted estimate that respect the known bounds, we may use logistic regression rather than ordinary least-squares [12].
The above procedure has both similarities and substantial differences compared to the estimation procedure proposed by Sun et al.
[41]. The main difference is that our procedure is targeted towards efficient estimation of and inference about the ATE of ψ 0 of the optimal ITR under a nonparametric model, while Sun et al.
[41] focus on estimating the optimal ITR ρ 0 and does not evaluate this optimal ITR. This leads to a key difference between the two procedures when estimating the optimal ITR: we need to solve an estimating equation (5), which is crucial to ensuring that the estimator ψ n is asymptotically linear, while Sun et al.
[41] do not. The requirement of solving (5) is related to the nature of the fractional knapsack problem discussed in Remark 3, and we conjecture that such a calibration on the resource used is necessary for general problems of the same nature. Our procedure is also related to the method in Sun  and Kennedy [17] or R-learning as in Nie and Wager [28]. These methods were developed for conditional average treatment effect estimation and might lead to better estimators of δ Y 0 and δ C 0 . It is also possible to develop multiply robust methods to estimate ξ 0 using influence function techniques. Such methods to estimate ξ 0 are beyond the scope of our paper, whose main focus is on the inference for the ATE. Our theoretical analysis of the estimator only applies to naïve estimators based on outcome regression, but we expect only minor modifications to be required to study these more advanced estimators once their asymptotic behavior is characterized.

Remark 5. In
Step 2(a), it is also viable to use other efficient estimators of φ 0 , for example, a targeted minimum loss-based estimator (TMLE). We note that estimating φ 0 is only one component of estimating the optimal ITR ρ 0 . Methods such as TMLE can be preferable to ensure that the estimator respects known bounds on the estimand. However, in our case, such an improvement in estimating φ 0 does not necessarily lead to an improvement in the estimation of ρ 0 .
We now present results on the asymptotic linearity and efficiency of our proposed estimator. We state and discuss the technical conditions required by the theorem below in Supplement S1.
Theorem 4 (Asymptotic linearity of ATE estimator). Let R ∈ {FR, RD, TP}. Under Conditions B1-B12, with the canonical gradient D R (P 0 ) defined in (3) and (4), it holds that Since ψ n is asymptotically linear with influence function equal to the canonical gradient, ψ n is also asymptotically efficient.
To conduct inference about ψ 0 , we can directly plug the estimators of nuisance functions into D R (P 0 ) to obtain a consistent estimator of D R (P 0 ), and then take the sample variance to obtain a consistent estimator of the asymptotic variance σ 2 0 . The proof of Theorem 4 can be found in Supplements S4.3 and S4.4.

Remark 6.
It may be desirable to use cross-fitting [26, 56] to estimate an optimal ITR for better finitesample performance. The asymptotic linearity is maintained by a similar argument that is used to prove Theorem 4. We describe this algorithm in Section S3 in the Supplementary Material.

Remark 7.
We note that, unlike in Qiu et al. [34] where the bound κ lies in (0, 1] due to the binary nature of treatment status, the methods we propose here do not require knowledge of an upper bound on treatment costs. When such a bound is indeed known (e.g., one), our methods may still be applied as long as all special cases corresponding to κ = ∞ or κ < ∞ in Section 4 are replaced by κ being equal to or less than the known bound, respectively.

Simulation setting
In this simulation study, we investigate the performance of our proposed estimator of the ATE of an optimal ITR relative to specified reference ITRs. We focus here on the setting α = 1. This scenario is more difficult than the case α = 0 because it requires the estimation of φ 0 .
We generate data from a model in which the treatment T is an IV and the treatment cost C and outcome Y are both binary. This data-generating mechanism satisfies all causal conditions and has an unobserved confounder between treatment cost and outcome. We first generate a trivariate covariate W = We also simulate an unobserved treatment-outcome confounder U ∼ Bernoulli(0.5) independently of W , and then simulate T , C, and Y as follows: We introduce U in the data-generating mechanism to emphasize that we do not require assumptions on the joint distribution of treatment cost and outcome conditional on covariates. We consider all three reference ITRs R ∈ {FR, RD, TP}, where we set ρ FR : v → 0. We set κ = 0.68, which is an active constraint with τ 0 > 0 and ρ RD 0 < 1.
The ITRs we consider are based on all covariates -that is, we take V (W ) = W . We estimate the nuisance functions using the Super Learner [48] with library including a logistic regression, generalized additive model with logit link [13], gradient boosting machine [10,11,23,24], support vector machine [2,8] and neural network [3,35]. Because none of the nuisance functions follow a logistic regression model, the resulting ensemble learner is not expected to achieve the parametric convergence rate. Since both C and Y are binary, we use logistic regression rather than ordinary least-squares to obtain their corresponding targeted estimates in Section 4.2. We consider sample size n ∈ {500, 1000, 4000, 16000}, and run 1000 Monte Carlo repetitions for each sample size. We implement the algorithm that incorporates cross-fitting discussed in Remark 6 and described in Section S3 in the Supplementary Material.
To evaluate the performance of our proposed estimator, we investigate the bias and root mean squared error (RMSE) of the estimator. We also investigate the coverage probability and the width of nominal 95% Wald CIs constructed using influence function-based standard error estimates. We further investigate the probability that our confidence lower limit falls below the true ATE, that is, the coverage probability of the 97.5% Wald confidence lower bound. Table 1 presents the performance of our proposed estimator in this simulation. For sample sizes 500, 1000

Simulation results
and 4000, the CI coverage of our proposed method is lower than the nominal coverage 95%. When sample size is larger (16000), the CI coverage of our proposed method increases to 90-93%. The coverage of the confidence lower bounds is much closer to nominal (97.5%) for all sample sizes considered, though, and is of our proposed estimator appear to converge to zero faster than and at the same rate as the square root of sample size, respectively. All biases are negative, which is expected in view of Remark 6. All standard errors underestimate the variation of the estimator with the extent decreasing as sample size increases. indicates that the CI width should shrink at a root-n rate, and our simulation results are consistent with this. There are some outlying cases of extremely wide or narrow CIs. This is expected for small sample sizes because the estimator of σ 2 0 in Theorem 4 resembles a sample mean and might not be close to σ 2 0 with high probability when sample size is small. In practice, this issue might be slightly mitigated by fine-tuning the involved machine learning algorithms.
As indicated in Theorem 4, theoretical guarantees for the validity of the Wald CIs rely on the nuisance function estimators converging to the truth sufficiently quickly. It appears that the undercoverage of our Wald CI in small samples may owe, in part, to poor estimation of these nuisance functions in small sample sizes. To illustrate how our procedure may perform with improved small-sample nuisance function This motivates seeking ways to optimize the finite-sample performance of the nuisance function estimators employed in future applications of the proposed method, possibly based on prior subject-matter expertise.
The underestimation of standard errors in this simulation also motivates future work exploring whether there are standard error estimators with better finite-sample performance, for example, estimators based on the bootstrap.

Conclusion
There is an extensive literature on estimating optimal ITRs and evaluating their performance. Among these works, only a few incorporated treatment resource constraints. In this paper, we build upon Sun et al. [41] and study the problem of estimating optimal ITRs under treatment cost constraints when the treatment cost is random. Using similar techniques as used in Qiu et al. [34], we have proposed novel methods to estimate an optimal ITR and infer about the corresponding average treatment effect relative to a prespecified reference ITR, under a locally nonparametric model. Our methods may also be applied to instrumental variable (IV) settings studied in Qiu et al. [34] when the IV is intervened on.
[14] Imai, K. and M. L. Li (2021 This Supplementary Material is organized as follows. Section S1 contains technical conditions to ensure that the statistical parameter of interest, the average treatment effect, is pathwise differentiable and that our proposed estimator is asymptotically efficient. We discuss a particular technical condition that may be difficult to verify in Section S2. In Section S3, we describe a modified version of our proposed estimator with improved performance in small to moderate samples. We present proofs of theoretical results in Section S4. In Section S5, we present the results of a simulation under an idealized setting.
These results may provide guidance on interpreting the simulation results in Section 5.
As noted in the main text, the methods proposed in this work build upon tools used in Qiu et al. [34]; as such, the involved technical details bear similarity. To orient readers and facilitate comparisons, we have organized these supplementary materials for these papers similarly and shared portions of technical details when appropriate. S1 Technical conditions for pathwise differentiability of parameter and asymptotic linearity of proposed estimator In this section, we list the additional technical conditions required by Theorems 3 and 4 in Section 4 that we omit in the main text. Before doing this, we define pointwise Condition B2 (Nonzero continuous density of ξ 0 (V ) around η 0 ). If η 0 > −∞, then the distribution of ξ 0 (V ) has positive, finite and continuous Lebesgue density in a neighborhood of η 0 .
Since Condition B2 is most plausible when covariates are continuous, in this case, it is also plausible to expect the distribution of ξ 0 (V ) to be continuous and thus Condition B2 holds.
Condition B3 (Smooth treatment cost function or lack of constraint). If η 0 > −∞, then the function Condition B3 requires different conditions in separate cases. There are three cases in terms of the sufficiency of the budget to treat every individual: (i) there is an infinite budget and no constraint is present (κ = ∞); (ii) the budget is insufficient (η 0 > −∞); and (iii) the budget is finite but sufficient (η 0 = −∞ and κ < ∞). Condition B3 makes no assumption for Case (i). In Case (ii), we require require that the budget has a surplus. When it is unknown a priori whether the budget is sufficient to treat every individual, namely in Case (ii) or (iii), it is highly unlikely that the budget exactly suffices with no surplus. Therefore, Condition B3 is mild.
Condition B5 requires that, when the rule ρ RD that assigns treatment completely at random while respecting the budget constraint is the reference rule of interest, it should not correspond to the trivial rule v → 1 that assigns treatment to every individual. The rule ρ RD equals v → 1 only when the budget is sufficient to treat every individual. Since, as a separate reference rule from given fixed rules ρ FR , the reference rule ρ RD is only interesting when the budget constraint is active, Condition B5 often holds automatically.
Condition B6 holds if all above nuisance estimators converge at a rate faster than n −1/4 , which may be much slower than the parametric rate n −1/2 and thus allows for the use of flexible nonparametric estimators. This condition also holds if µ Y n ,μ Y n , µ C n andμ C n each converges slower than n −1/4 , as long as the estimated propensity score µ T n converges sufficiently fast to compensate.
Condition B7 (Consistency of estimated influence function). The following terms are all o p (1): Condition B8 (Consistency of strong positivity). With probability tending to one over the sample used to obtain µ T n , it holds that I{ T < µ T n (w) < 1 − T }dP 0 (w) = 1.
Condition B9 (Consistency of strictly more costly treatment). With probability tending to one over the sample used to obtain ∆ C n and δ C n , it holds that I(∆ C n (w) > δ C )dP 0 (w) = 1 and Condition B10 (Fast rate of estimated optimal ITR). As sample size n tends to infinity, it holds that Condition B10 may, at first sight, appear to be difficult to verify and is discussed in detail in Section S2.
As shown in Theorem S1 of Section S2, Condition B10 may require faster rates on nuisance estimators than Condition B6. For example, convergence in the L 2 -sense at a rate o p (n −1/4 ) is sufficient for Condition B6, but a rate o p (n −3/8 ) is needed in order to use Theorem S1 to show that Condition B10 holds.
The Donsker condition B11 and the Glivenko-Cantelli condition B12 impose restrictions on the flexibility of the methods used to estimate nuisance functions. We refer readers to, for example, van der Vaart and Wellner [50], for a more thorough introduction to such conditions.
All above conditions are similar to those in Qiu et al. [34] except that Conditions B9 and B4 are additional in this paper because the assumption of more costly treatment was not needed and a boundedness condition similar to B4 was automatically satisfied with a binary cost.
S2 Sufficient condition for fast convergence rate of estimated optimal rule Condition B10, which is required by Theorem 4, may seem unintuitive and difficult to verify. In Theorem S1 below, we present sufficient conditions for Condition B10 that are similar to those in Qiu et al. [34].
Throughout the rest of the Supplement, for two quantities a, b ∈ R, we use a b to denote a ≤ Cb for some constant C > 0 that may depend on P 0 .
Further assume that each of o → I(ξ n (v) > η n ) and o → I(ξ n (v) > η n )δ C 0 (v) belongs to a (possibly different) fixed P 0 -Donsker class with probability tending to 1. Suppose also that the distribution of ξ 0 (V ) (V ∼ P 0 ) has nonzero finite continuous Lebesgue density in a neighborhood of η 0 and a neighborhood of τ 0 . Under Condition B4, the following statements hold.
The proof of Theorem S1 is very similar to Theorem 5 in Qiu et al. [34] and can be found in Section S4.5.

S3 Modified procedure with cross-fitting
In this section, we describe our proposed procedure to estimate the ATE with cross-fitting, which is mentioned in Remark 6. We use Λ to denote a user-specified fixed number of folds to split the data.
Common choices of Λ used in practice include 5, 10 and 20.
1. Use the empirical distributionP W,n of W as an estimate of the true marginal distribution of W .
Compute estimates µ Y n , µ C n and µ T n of µ Y 0 , µ C n and µ T 0 , respectively using flexible regression methods.
(e) Compute k n , which is used to define an estimate of ρ 0 for which the plug-in estimator is asymptotically linear.
• If τ n (κ) > 0 and there is a solution in k ∈ [0, ∞) to then take k n to be this solution.
3. Obtain an estimate ρ R n of the reference ITR ρ R 0 as follows: • For R = FR, take ρ R n to be ρ FR .
• For R = RD, (a) obtain a targeted estimateμ C n of µ C 0 : run an ordinary least-squared regression using observations i = 1, 2, . . . , n with outcome C i , offset µ C n (T i , W i ), no intercept and covariate . Takeμ C n to be the fitted mean model; (b) take ρ R n to be the constant function o → min{1, (κ − αφ n )/P W,n∆ C n }, where we define pointwise∆ C n : w →μ C n (1, w) −μ C n (0, w).
• For R = TP, take ρ R n to be µ T n .
4. Estimate ATE of ρ 0 relative to the reference ITR ρ R 0 with a targeted minimum-loss based estimator (TMLE) ψ n : (a) obtain a targeted estimateμ Y n of µ Y 0 : run an ordinary least-squares linear regression using observations i = 1, 2, . . . , n with outcome Y i , offset µ Y n (T i , W i ), no intercept and covariate . Takeμ Y n to be the fitted mean function.
(b) withP n being any distribution with componentsμ Y n andP W,n , set ψ n := 1

S4.1 Identification results (Theorem 1 and 2)
Theorem 1 is a simple corollary of the standard G-formula [36]. We provide a complete proof below.
Proof of Theorem 1. Note that The results for the treatment cost can be proved similarly.
We next prove Theorem 2.

Proof of Theorem 2. Let ρ be any ITR that satisfies the constraint that E
, implying that ρ 0 is a solution to (2).
Observe that Combining this observation with the fact that τ 0 ≥ 0, the above shows that . Therefore, we conclude that ρ 0 is a solution to (2).

S4.2 Pathwise differentiability of ATE parameter (Theorem 3)
We follow existing literature on semiparametric efficiency theory closely to prove pathwise differentiability of our estimands and asymptotic efficiency of our estimators under nonparametric models. We refer readers to, for example, Pfanzagl [30,31], Bolthausen et al. [4], for a more thorough introduction to semiparametric efficiency.
To derive the canonical gradient of the ATE parameters, let H ⊆ L 2 0 (P 0 ) be the set of score functions with range contained in [−1, 1] and we study the behavior of the parameters under perturbations in an arbitrary direction H ∈ H. We note that the L 2 0 (P 0 )-closure of H is indeed L 2 0 (P 0 ).
and P H, via its Radon-Nikodym derivative with respect to P 0 : for any in a sufficiently small neighborhood of 0 such that the right-hand side is positive for all o ∈ W × {0, 1} × {0, 1} × R. It is straightforward to verify that the score function for at = 0 is indeed H.
For the rest of this section, we may drop H from the notation and use P as a shorthand notation for P H, when no confusion should arise.
We will see that each parameter evaluated at P depends on the following marginal or conditional distributions in a clean way: the marginal distribution P W, of W , the marginal distribution P T,W, of (T, W ), the conditional distribution P T, of T given W , the conditional distribution P C, of C given (T, W ), and the conditional distribution P Y, of Y given (T, W ). We now derive their closed-form expressions. Let We finally introduce some additional notations that are used for the rest of the section. We use C to denote a generic positive constant that may vary line by line. Let S 0 be the survival function of the distribution of ξ 0 (V ) when V ∼ P 0 . We also use the notation defined in Section S2. For a generic function f : R → R, we will use the big-and little-oh notations, namely O(f ( )) and o(f ( )), respectively, to denote the behavior of f ( ) as → 0. Finally, for a general function or quantity f P that depends on a distribution P , we use f to denote f P . For example, we may write µ Y as a shorthand for µ Y P . We will also write expectations under P as E .
The derivation of the canonical gradients of P → Ψ ρ FR (P ) can be found in the Supplement of Qiu et al. [34]. We now derive the canonical gradients of P → Ψ ρ TP P (P ), P → Ψ ρ RD P (P ) and P → Ψ ρ P (P ), which are different from the parameters in Qiu et al. [34].

S4.2.1 Canonical gradient of
Fix a score H ∈ H. Note that, for all P ∈ M, Ψ ρ TP P (P ) = µ T P (w)∆ Y P (w)P W (dw). Combining this, (S3) and the chain rule yields that and E 0 [H W (W )] = 0. Therefore, the canonical gradient of P → Ψ ρ TP P (P ) at P 0 is G TP (P 0 ).

S4.2.2 Canonical gradient of
Let H be a score function in H. We aim to show that which shows that P → Ψ ρ RD P (P ) is pathwise differentiable with canonical gradient G RD (P 0 ) at P 0 .
Consequently, for each in this neighborhood, Ψ ρ RD (P ) = κ−αP µ C (0,·) . It follows that the derivative d d P µ C (0, ·) =0 is the same as the derivative of f : → κ−P µ C (0,·) P ∆ C Ψ v →1 (P ) at = 0, provided this derivative exists. Noting that v → 1 is a particular instance of a fixed treatment rule, we may take ρ FR to be v → 1 in the results on pathwise differentiability of P → Ψ ρ FR (P ) and show that As both the above derivative and the derivatives in (S5) and (S6) exist, by the chain rule, it follows that Note that φ P = P µ C P (0, ·). Plugging (S5), (S6) and (S7) into the above and we can show that the righthand side of the above is equal to the right-hand side of (S4).
, we have shown that (S4) holds, and the desired result follows.

S4.2.3 Canonical gradient of P → Ψ ρ P (P )
Let H be a score function in H. The argument that we use parallels that of Luedtke and van der Laan [20] and Qiu et al. [34], except that it is slightly modified to account for the fact that the resource constraint takes a different form in this paper.
We first note that all of following hold for all sufficiently close to zero: sup The derivations of these inequalities are straightforward and hence omitted. Under Condition A4, the above inequalities imply that For sufficiently close to zero, it will be useful to define for η ∈ [−∞, ∞). We also define Γ : η → d ds Γ (s)| s=η when the derivative exists. We first show two lemmas. These two lemmas show that, under a perturbed distribution P with magnitude , the fluctuation in the threshold τ − 0 is of order . This result is crucial in quantifying the convergence rate of two terms in the expansion of Ψ ρ (P ) − Ψ ρ 0 (P 0 ), namely terms 1 and 3 in (S12) below. In particular, term 1 is the main challenge in the analysis as it comes from the perturbation in the threshold and is unique in estimation problems involving the evaluation of optimal ITRs. The first studies the convergence of η to η 0 . Because it may be the case that η 0 = −∞, the convergence stated in this result is convergence in the extended real line.
The next lemma establishes a rate of convergence of τ to τ 0 as → 0.
A similar argument, which is based on the observation that Γ (η + | |) + φ ≤ κ, can be used to show that there exists an O( ) sequence such that η − η 0 > O( ). Combining these two bounds shows that η − η 0 = O( ), as desired. This concludes the proof.
Our derivation of the canonical gradient is based on the following decomposition: We separately study each of the six terms on the right-hand side, which we refer to as term 1 up to term 6.
Study of term 1 in (S12): We will show that this term is o( ). By Lemma S2 and (S9), Under Condition B2, P 0 {ξ 0 (V ) = τ 0 } = 0. We apply a similar argument as that used to prove Lemma 2 in van der Laan and Luedtke [46]: Because ρ (v) = ρ 0 (v) implies that either (i) ξ (v) − τ and ξ 0 (v) − τ 0 have different signs or (ii) only one of these quantities is zero, the display continues as Using the facts that inf v δ C 0 (v) > 0 by Condition A4, that sup v δ C 0 (v) ≤ 1 since probabilities are no more than one, and that ξ 0 (v) := δ Y 0 (v)/δ C 0 (v), the display continues as Leveraging the bound on |ξ 0 (v) − τ 0 | that appears in the indicator function, we see that where the final equality holds by Condition B1. The integral in the final expression is o(1), and so this expression is o( ).
Study of term 2 in (S12): By the result on the pathwise differentiability of P → Ψ ρ FR (P ), setting ρ FR to be ρ 0 , we see that the second term satisfies Study of term 3 in (S12): We will show that the third term is identical to zero for any that is sufficiently close to zero. If τ 0 = 0, then this term is trivially zero. Otherwise, τ 0 = η 0 > 0. Lemma S1 shows that, in this case, η > 0 for sufficiently close to zero. Hence, Consequently, term 3 equals zero for all sufficiently close to zero.
Study of term 4 in (S12): We will show that this term can be writes as for an appropriately defined G 4 ∈ L 2 0 (P 0 ) that does not depend on H. Note that there exists a function The function H W can be chosen so that the above o( ) term indicates little-oh behavior uniformly over w and v. By the definition of H C from (S3), we see that where the little-oh terms are uniform over w and v. Hence, .
s., the display continues as As a consequence, term 4 satisfies Study of term 5 in (S12): By (S3) and the fact that P 0 {δ C 0 ρ 0 } = κ − αφ 0 whenever τ 0 > 0, we see that , we see that it also holds that −τ 0 (P − P 0 ){δ C 0 ρ 0 } = Study of term 6 in (S12): We have shown that Conclusion of the derivation of the canonical gradient of P → Ψ ρ P (P ): Combining our results regarding the six terms in (S12), we see that Dividing both sides by = 0 and taking the limit as → 0, we see that G 2 + G 4 + G 5 + G 6 = G(P 0 ) is the canonical gradient of P → Ψ ρ P (P ) at P 0 .

S4.3 Expansions based on gradients or pseudo-gradients
In this section, we present (approximate) first-order expansions of ATE parameters based on which we construct our proposed targeted minimum-loss based estimators (TMLE) and prove their asymptotic linearity. We refer the readers to Supplement S5 of Qiu et al. [34] for an overview of TMLE based on gradients and pseudo-gradients. The overall idea behind TMLE based on gradients is the following: the empirical mean of the gradient at the estimated distribution can be viewed as the first-order bias of the plug-in estimator; this bias can be removed by solving the estimating equation that equates the firstorder bias to zero. The idea behind pseudo-gradients is similar, except that the gradient is replaced by an approximation that we term pseudo-gradient so that the corresponding estimating equation is easy to solve with a single regression step.
For Ψ ρ FR and Ψ ρ TP P , it is straightforward to show that the following expansions hold: For P → Ψ ρ RD P (P ), we expand this parameter sequentially as follows: For P → Ψ ρ P (P ), straightforward but tedious calculation shows that the following expansion holds:

S4.4 Asymptotic linearity of proposed estimator (Theorem 4)
For convenience, we setP n to have component µ T n andμ C n , even though the plug-in estimator does not explicitly involve these functions. We start with some lemmas that facilitate the proof of the main theorem. In this section, we define η n := η n (k n ) and τ n := τ n (k n ) to simplify notations.
Our proof is centered around the expansions in Supplement S4.3. We first prove a few lemmas.
Lemma S3 is a standard asymptotic linearity result on estimators φ n and P n∆ C n about treatment resource being used for constant ITRs v → 0 and v → 1, respectively; Lemma S4 is a technical convenient tool to convert conditions on norms in Condition B6 between functions; Lemmas S5-S7 are analysis results for estimators that are similar to Lemmas S1-S2 for deterministic perturbations of P 0 , and they lead to the crucial Lemma S8 on the negligibility of the remainder R ρ (P n , P 0 ) for an arbitrary ITR ρ.
Lemma S3 (Asymptotic linearity of φ n and P n∆ C n ). Under the conditions of Theorem 4, This result follows from the facts that (i) φ n is a one-step correction estimator of φ 0 [30], and (ii) 44,49]. Therefore the proof is omitted.
The following Lemmas S5-S7 prove consistency of the estimated thresholds used to define the estimated optimal ITR ρ n .
Lemma S6 (Consistency of η n (κ)). Under Conditions B2, B3 and B12, η n (κ) This lemma is a stochastic variant of the deterministic result in Lemma S1 and has a similar proof.
Therefore, the arguments are slightly abbreviated here.
First consider the case where η 0 > −∞. We start by showing that, for any η sufficiently close to η 0 , it holds that Γ n (η) − Γ 0 (η) = o p (1). Fix an η in a neighborhood of η 0 . By the triangle inequality, We will show that the right-hand side is o p (1). By Condition B12, the third term on the right is o p (1) for η sufficiently close to η 0 . Moreover, because the second term is no greater than ∆ C n − ∆ C 0 1,P 0 , Condition B12 also implies that this second term is also o p (1). We will now argue that the first term is o p (1). By Lemma S5 and Condition B4, for any > 0, where the final relation follows from Markov's inequality. We next show that the last line is o p (1).
Proof of i from Lemma S7. Our strategy for showing the existence of a solution to (5) is as follows. First, we show that the left-hand side of (5) consistently estimates the treatment resource being used uniformly over rules {d n,k : k ∈ [0, ∞]}. Next, we show that the left-hand side of (5) is a continuous function in k that takes different signs at k = 0 and k = ∞ with probability tending to one.
We rely on the fact that, for fixed d n,k , P n f n,k is a one-step estimator of P 0 {d n,k ∆ C 0 }. Note that Conditions B7 and B11 along with Lemma S4 imply that the first term on the right-hand side is O p (n −1/2 ).
For the second term, we note that the fact that d n,k (V (w)) ∈ Hence, the second term is o p (n −1/2 ) by Condition B6. The third term is also o p (n −1/2 ) by an almost identical argument. Combining the previous two displays shows that (S14) holds.
Therefore, P n f n,0 + αφ n < κ with probability tending to one. Applying this result at k = ∞ shows that Combining this fact with the fact that P 0 ∆ C 0 + αφ 0 > κ whenever η 0 > −∞ shows that P n f n,1 + αφ n > κ with probability tending to one. Combining these results at k = 0 and k = ∞ with the fact that k → P n f n,k is a continuous function shows that, with probability tending to one, there exists a k n ∈ [0, ∞) such that P n f n,k n = κ − αφ n .
Lemma S6 then implies η n = η n (k n ) with probability tending to 1.
Proof of iii from Lemma S7 when η 0 > 0. In this proof, we use P n 0 to denote a probability statement over the draws of O 1 , . . . , O n . Fix > 0. We will argue by contradiction to show that P n 0 {η n ≥ η 0 + } → 0 and P n 0 {η n ≤ η 0 − } → 0 as n → ∞, implying the consistency of η n . The consistency of τ n then follows.
We study these two events separately. First, we suppose that lim sup n P n 0 {η n ≥ η 0 + } > 0. (S15) Then there exists δ > 0 such that, for all n in an infinite sequence N ⊆ N, the probability P n 0 {η n ≥ η 0 + } is at least δ. Consequently, for any n ∈ N , the following holds with probability at least δ: (S16) We now show that the first term is o p (1). For any x > 0 and n ∈ N, by Lemma S5 and Condition B4, Similarly to the proof of Lemma S6, the fact that ξ n − ξ 0 1,P 0 = o p (1) (Condition B12) ensures that constant. Because (S16) holds with probability at least δ > 0 for infinitely many n, this shows that . This contradicts our result from part ii of this lemma. Therefore, (S15) is false, that is, lim sup n P n 0 {η n ≥ η 0 + } = 0. Now we assume that, for some > 0, lim sup n P n 0 {η n ≤ η 0 − } > 0. Then there exists δ > 0 such that, for all n in an infinite sequence N ⊆ N, P n 0 {η n ≤ η 0 − } ≥ δ. Now, for any n ∈ N , the following holds with probability at least δ: The rest of the argument is almost identical to the contradiction argument for the previous event, and is therefore omitted.
Proof of iii from Lemma S7 when η 0 = 0. If η 0 = 0, then the construction of η n implies that η n takes values from two sequences: η n (κ) and η n (k n ) where k n is a solution to (5). By Lemma S6, η n (κ) is consistent for η 0 . When a solution to (5) exists and equals k n , the proof of part iii from Lemma S7 when η 0 > 0 shows that η n (k n ) is consistent for η 0 and the desired result follows.
Proof of Lemma S8. By the boundedness of the range of ρ, we see that Using Condition B8 and Lemma S4, the display continues as The right-hand side is o p (n −1/2 ) by Condition B6.
We next prove Theorem 4.
Proof of Theorem 4. By the expansion of P → Ψ ρ P (P ) presented in Section S4.3, = (P n − P 0 )D(P 0 , ρ 0 , τ 0 , µ C 0 ) − P n D(P n , ρ n , τ 0 , µ C n ) + (P n − P 0 ) D(P n , ρ n , τ 0 , µ C n ) − D(P 0 , ρ 0 , τ 0 , µ C 0 ) Similarly, First, we note the following facts, which will be sufficient to ensure that the remainders and empirical process terms in all of the first-order expansions given above are o p (n −1/2 ). By Condition B6, Lemmas S4 and S8, the Cauchy-Schwarz inequality, and boundedness of the range of an ITR, the following terms are all o p (n −1/2 ): Moreover, by Condition B10, Therefore, all relevant remainders and empirical process terms are o p (n −1/2 ).
We separately study the three cases where R = FR, R = RD, and R = TP.
Case I: R = FR. It holds that where the last step follows from the TMLE construction ofP n (Step 4a of our estimator), which implies We now show that the second term on the right-hand side is zero with probability tending to one. If τ 0 = 0, then this term is zero. Otherwise, τ 0 = η 0 > 0. By Lemma S7, the following holds with probability tending to one: and hence the second term is zero with probability tending to one, as desired. Therefore, ψ n − ψ 0 = (P n − P 0 )D FR (P 0 ) + o p (n −1/2 ).
Case II: R = RD. It holds that where we have used ρ RD n and ρ RD 0 to denote the values that the two functions take, respectively. The TMLE construction ofP n (Step 4a of our estimator) implies that and hence which is zero with probability tending to one as proved above. By Condition B5, Lemma S3 and the delta method for influence functions, the value that ρ RD n takes is an asymptotic linear estimator of the value that ρ RD 0 takes. Straightforward application of the delta method for influence functions implies that Case 3: R = TP. It holds that ψ n − ψ 0 = (P n − P 0 )D TP (P 0 ) − P n D n,TP + o p (n −1/2 ).
The TMLE construction ofP n (Step 4a of our estimator) implies that which is zero with probability tending to one as proved above. Therefore, ψ n − ψ 0 = (P n − P 0 )D TP (P 0 ) + o p (n −1/2 ).

Conclusion:
The asymptotic linearity result on ψ n follows from the above results. Consequently, the asymptotic normality result on ψ n holds by the central limit theorem and Slutsky's theorem.

S4.5 Proof of Theorem S1
In this section, we prove Theorem S1. The arguments are almost identical to those in Supplement S9 Qiu et al. [34] with adaptations to the different treatment resource constraint.
The condition that P 0 I(ξ n = η n ) = O p (n −1/2 ) is reasonable if ξ n (V ) has a continuous distribution when V ∼ P 0 , in which case P 0 I(ξ n = η n ) = 0.
We first study the case where η 0 > 0. By Lemma S7, with probability tending to one, η n = η n (k n ) where k n is a solution to (5), and We argue conditionally on the event that k n is a solution to (5).
. By a Taylor expansion of Γ 0 under Conditions B2, B3 and A4, the left-hand side is equal to −C(η n − η 0 ) + o p (η n − η 0 ) for some C > 0, yielding that which immediately implies that The rest of the proof for this case and the proof for the other two cases are identical to the proof of Lemma S14 in Qiu et al. [34]. We present the argument below for completeness. By Lemma S5 and Condition B4, for any > 0 it holds that when V ∼ P 0 , around η 0 , which is valid under Condition B2 provided n is sufficiently small, it follows Here we recall that (S 0 ) (η 0 ) is finite by Condition B2. Returning to (S17), If ξ n −ξ 0 q,P 0 = o p (1) for some 0 < q < ∞, by Markov's inequality, P 0 I(|ξ n −ξ 0 | > n ) ≤ ξ n −ξ 0 q q,P 0 / q n .
We finally study the case where η 0 = 0. We argue conditional on the event that a solution k n to (5) exists, which happens with probability tending to one by Lemma S7. Recall that for convenience we let k n = k n when η n (κ) > 0. Then, exactly one of the following two events happen: (i) η n (κ) ≤ 0 or η n (k n ) ≤ 0, in which case τ n = 0 = τ 0 ; (2) η n (κ) > 0 and η n (k n ) > 0, in which case a similar argument as the above proof for the case where η 0 > 0 shows that the distance between τ n = η n (k n ) and τ 0 has the desired bound. The desired result holds conditional on either event, so it holds unconditional on either event.
We finally prove Theorem S1.
Proof of Theorem S1. Observe that (S18) Starting from this inequality, the rest of the proof is identical to that of Theorem 5 in Qiu et al. [34].
We present the argument below for completeness. Let { n } ∞ n=1 be a positive sequence, where each n is random through the observations O 1 , . . . , O n , such that n p → 0 as n → ∞.
We denote the three terms on the right-hand side by terms 1, 2, and 3, and study these terms separately. It is useful to note that τ n − τ 0 = o p (1), so the Lebesgue density of the distribution of ξ 0 (V ), V ∼ P 0 , is finite in a neighborhood of τ n with probability tending to one.
Therefore, term 2 bounds as where the last step holds for with probability tending to one by the assumption that the distribution of ξ 0 (V ), V ∼ P 0 , has a continuous finite Lebesgue density in a neighborhood of τ 0 and Lemma S7. If η 0 > −∞, by Lemma S9, with probability tending to one, .
Otherwise, by Lemma S6, with probability tending to one, τ n = 0 = τ 0 and the above result still holds.

S5.1 Results of simulation with nuisance functions being truth
In this section, we present the results of the simulation with an identical setting as that in Section 5 in the main text except that the nuisance functions are taken to be the truth rather than estimated via machine learning. The purpose of this simulation is to show that the performance of our proposed estimator may be significantly improved by using machine learning estimators of nuisance functions that outperform those used in the simulation study reported in the main text. Table S1 presents the performance of our proposed estimator in this simulation. The Wald CI coverage is close to 95% for sample sizes of 1000 or more. The coverage of the confidence lower bounds is also close to the nominal coverage of 97.5%. Therefore, our proposed procedure appears to have the potential to be significantly improved when using improved estimators of nuisance functions. Figure S1 presents the width of our 95% Wald CI scaled by the square root of sample size n. For each estimand, the scaled width appears to stabilize as n grows and to be similar to the scaled width observed in the simulation reported in Section 5, where nuisance functions are estimated from data.

S5.2 Simulation under a low dimension and a parametric model
In this section, we describe the additional simulation in a setting with a low dimension and a parametric model as well as the simulation results.
The data is generated as follows. We first generate a univariate covariate W ∼ Unif(−1, 1). We then generate T , C and Y as follows: T | W ∼ Bernoulli (expit(W )) , C | T, W ∼ Bernoulli (expit(2T − 1 + W )) , where C and Y are independent conditional on (W, T ). We set ρ FR : v → 0, V = W , and κ = 0.35, which is an active constraint with τ 0 > 0 and ρ RD 0 < 1. We use logistic regression to estimate functions µ T 0 , µ C 0 and µ C 0 . All other simulation settings are identical to that in Section 5.
The simulation results are presented in Table S2 and Figure S2. The performance is generally between the nonparametric setting in Section 5 and the oracle setting in Section S5.1. The CI coverage is much better than the nonparametric case, thus suggesting that our method might perform better with improved estimators of nuisance functions µ T 0 , µ C 0 and µ C 0 .   n n*CI width Figure S2: Boxplot of √ n× CI width for ATE relative to each reference ITR in the simulation with nuisance functions in a parametric model.