# Optimal Individualized Treatments in Resource-Limited Settings

Alexander R. Luedtke and Mark J. van der Laan

# Abstract

An individualized treatment rule (ITR) is a treatment rule which assigns treatments to individuals based on (a subset of) their measured covariates. An optimal ITR is the ITR which maximizes the population mean outcome. Previous works in this area have assumed that treatment is an unlimited resource so that the entire population can be treated if this strategy maximizes the population mean outcome. We consider optimal ITRs in settings where the treatment resource is limited so that there is a maximum proportion of the population which can be treated. We give a general closed-form expression for an optimal stochastic ITR in this resource-limited setting, and a closed-form expression for the optimal deterministic ITR under an additional assumption. We also present an estimator of the mean outcome under the optimal stochastic ITR in a large semiparametric model that at most places restrictions on the probability of treatment assignment given covariates. We give conditions under which our estimator is efficient among all regular and asymptotically linear estimators. All of our results are supported by simulations.

## 1 Introduction

Suppose one wishes to maximize the population mean of some outcome using some binary point treatment, where for each individual clinicians have access to (some subset of) measured baseline covariates. Such a treatment strategy is termed an individualized treatment regime (ITR), and the (counterfactual) population mean outcome under an ITR is referred to as the value of an ITR. The ITR which maximizes the value is referred to as the optimal ITR or the optimal rule. There has been much recent work on this problem in the case where treatment is an unlimited resource (see Murphy [1] and Robins [2] for early works on the topic, and Chakraborty and Moodie [3] for a recent overview). It has been shown that the optimal treatment in this context is given by checking the sign of the average treatment effect conditional on (some subset of) the baseline covariates, also known as the blip function [2].

The optimal ITR assigns treatment to people from a given strata of covariates for which treatment is on average beneficial, and does not assign treatment to this strata otherwise. If treatment is even slightly beneficial to all subsets of the population, then such a treatment strategy would suggest treating the entire population. There are many realistic situations in which such a treatment strategy, or any strategy that treats a large proportion of the population, is not feasible due to limitations on the total amount of the treatment resource. In a discussion of Murphy [1], Arjas observed that resource constraints may render optimal ITRs of little practical use when the treatment of interest is a social or educational program, though no solution to the constrained problem was given [4].

The mathematical modeling literature has considered the resource allocation problem to a greater extent. Lasry et al. [5] developed a model to allocate the annual CDC budget for HIV prevention programs to subpopulations which would benefit most from such an intervention. Tao et al. [6] consider a mathematical model to optimally allocate screening procedures for sexually tranmitted diseases subject to a cost constraint. Though Tao et al. do not frame the problem as a statistical estimation problem, they end up confronting similar optimization challenges to those that we will face. In particular, they confront the (weakly) NP-hard knapsack problem from the combinatorial optimization literature [7, 8]. We will end up avoiding most of the challenges associated with this problem by primarily focusing on stochastic treatment rules, which will reduce to the easier fractional knapsack problem [9, 8]. Stochastic ITRs allow the treatment to rely on some external stochastic mechanism for individuals in a particular strata of covariates.

We consider a resource constraint under which there is a maximum proportion of the population which can be treated. We primarily focus on evaluating the public health impact of an optimal resource-constrained (R-C) ITR via its value. The value function has been shown to be of interest in several previous works (see, e.g., Zhang et al. [10], van der Laan and Luedtke [11], Goldberg et al. [12]. Despite the general interest of this quantity, estimating this quantity is challenging for unconstrained deterministic regimes at so-called exceptional laws, i.e. probability distributions at which the blip function is zero in some positive probability strata of covariates [2]; a slightly more general assumption is given in Luedtke and van der Laan [13]. Chakraborty et al. [14] showed that one can develop confidence intervals for this parameter using m-out-of-n bootstrap, though these confidence intervals shrink at a slower than root-n rate. Luedtke and van der Laan [13] showed that root-n rate confidence intervals can be developed for this quantity under reasonable conditions in the large semiparametric model which at most places restrictions on the treatment mechanism.

We develop a root-n rate estimator for the optimal R-C value and corresponding confidence intervals in this same large semiparametric model. We show that our estimator is efficient among all regular and asymptotically linear estimators under conditions. When the baseline covariates are continuous and the resource constraint is active, i.e. when the optimal R-C value is less than the optimal unconstrained value, these conditions are far more reasonable than the non-exceptional law assumption needed for regular estimation of the optimal unconstrained value.

We now give a brief outline of the paper. Section 2 defines the statistical estimation problem of interest, gives an expression for the optimal deterministic rule under a condition, and gives a general expression for the optimal stochastic rule. Section 3 presents our estimator of the optimal R-C value. Section 4 presents conditions under which the optimal R-C value is pathwise differentiable, and gives an explicit expression for the canonical gradient under these conditions. Section 5 describes the properties of our estimator, including how to develop confidence intervals for the optimal R-C value. Section 6 presents our simulation methods. Section 7 presents our simulation results. Section 8 closes with a discussion and areas of future research. All proofs are given in the Appendix.

## 2 Optimal R-C rule and value

Suppose we observe n independent and identically distributed (i.i.d.) draws from a single time point data structure (W,A,Y)P0, where the vector of covariates W has support W, the treatment A has support {0,1}, and the outcome Y has support in the closed unit interval. Our statistical model is nonparametric, beyond possible knowledge of the treatment mechanism, i.e. the probability of treatment given covariates. Little generality is lost with the bound on Y, given that any continuous outcome bounded in [b, c] can be rescaled to the unit interval with the linear transformation (yb)/(cb). Suppose that treatment are resources are limited so that at most a κ(0,1) proportion of the population can receive the treatment A=1. Let V be some function of W, and denote the support of V with V. A deterministic treatment rule d˜ takes as input a function of the covariates vV and outputs a binary treatment decision d˜(v). The stochastic treatment rules considered in this work are maps from UV to {0,1}, where U is the support of some random variable UPU. If d is a stochastic rule and uU is fixed, then d(u,) represents a deterministic treatment rule. Throughout this work we will let U be drawn independently of all draws from P0.

For a distribution P, let QˉP(a,w)=ΔEP[Y|A=a,W=w]. For notational convenience, we let Qˉ0QˉP0. Let d˜ be a deterministic treatment regime. For a distribution P, let Ψ˜d˜=ΔEP0[QˉP(d˜(V),W)] represent the value of d˜. Under causal assumptions, this quantity is equal to the counterfactual mean outcome if, possibly contrary to fact, the rule d˜ were implemented in the population [15, 16]. The optimal R-C deterministic regime at P is defined as the deterministic regime d˜ which solves the optimization problem

(1)MaximizeΨ˜d˜PsubjecttoEP0d˜Vκ.

For a stochastic regime d, let ΨdP=ΔEPUΨ˜dU,.P represent the value of d. Under causal assumptions, this quantity is equal to the counterfactual mean outcome if, possibly contrary to fact, the stochastic rule d were implemented in the population (see Ref. [17] for a similar identification result). The optimal R-C stochastic regime at P is defined as the stochastic treatment regime d which solves the optimization problem

(2)MaximizeΨdPsubjecttoEPU×PdU,Vκ.

We call the optimal value under a R-C stochastic regime Ψ(P). Because any deterministic regime can be written as a stochastic regime which does not rely on the stochastic mechanism U, we have that Ψ(P)Ψ˜(P). The constraint EPU×P[d(U,V)]κ above is primarily meant to represent a clinical setting where each patient arrives at the clinic with covariate summary measure V, a value of U is drawn from PU for this patient, and treatment is then assigned according to d(U,V). By Fubini’s theorem, this is like rewriting the above constraint as EPEPU[d(U,V)]κ. Nonetheless, this constraint also represents the case where a single value of U=u is drawn for the entire population, and each individual is treated according to the deterministic regime d(u,), i.e. EPUEP[d(U,V)]κ. This case appears less interesting because, for a fixed u, there is no guarantee that EP[d(u,V)]κ.

For a distribution P, define the blip function as

Qˉb,P(v)=ΔEP|QˉP(1,W)QˉP(0,W)V=v.

Let SP represent the survival function of Qˉb,P, i.e. τPrP(Qˉb,P>τ). Let

ηP=Δinfτ:SP(τ)κ
(3)τP=ΔmaxηP,0.

For notational convenience we let Qˉb,0=ΔQˉb,P0, S0=ΔSP0, η0=ΔηP0, and τ0=ΔτP0.

Define the deterministic treatment rule d˜P as vI(Qˉb,P(v)>τP), and for notational convenience let d˜0=Δd˜P0. We have the following result.

Theorem 1

IfPrP(Qˉb,P(V)=τP)=0, then thed˜Pis an optimal deterministic rule satisfying the resource constraint, i.e. Ψ˜d˜P(P)attains the maximum described in eq. (1).

One can in fact show that d˜P is the P almost surely unique optimal deterministic regime under the stated condition. We do not treat the case where PrP(Qˉb,P(V)=τP)>0 for deterministic regimes, since in this case (1) is a more challenging problem: for discrete V with positive treatment effect in all strata, eq. (1) is a special case of the 0–1 knapsack problem, which is NP-hard, though is considered one of the easier problems in this class [7, 8]. In the knapsack problem, one has a collection of items, each with a value and a weight. Given a knapsack which can only carry a limited weight, the objective is to choose which items to bring so as to maximize the value of the items in the knapsack while respecting the weight restriction. Considering the optimization problem over stochastic rather than deterministic regimes yields a fractional knapsack problem, which is known to be solvable in polynomial time [9, 8]. The fractional knapsack problem differs from the 0–1 knapsack problem in that one can pack partial items, with the value of the partial items proportional to the fraction of the item packed.

Define the stochastic treatment rule dP by its distribution with respect to a random variable drawn from PU:

PrPUdP(U,v)=1={κSP(τP),ifQˉb,P(v)=τPandτP>0I(Qˉb,P(v)>τP),otherwise.

We will let d0=ΔdP0. Note that d˜P(V) and dP(U,V) are PU×P almost surely equal if PrP(Qˉb,P(V)=τP)=0 or if τP0, and thus have the same value in these settings. It is easy to show that

(4)EPU×P[dP(U,V)]=κifτP>0.

The following theorem establishes the optimality of the stochastic rule dP in a resource-limited setting.

Theorem 2

The maximum in eq. (2) is attained at d=dP, i.e. dPis an optimal stochastic rule.

Note that the above theorem does not claim that dP is the unique optimal stochastic regime. For discrete V, the above theorem is an immediate consequence of the discussion of the knapsack problem in Dantzig [9].

In this paper we focus on the value of the optimal stochastic rule. Nonetheless, the techniques that we present in this paper will only yield valid inference in the case where the data are generated according to a distribution P0 for which Pr0(Qˉb,0(V)=τ0)=0. This is analogous to assuming a non-exceptional law in settings where resources are not limited [13, 2], though we note that for continuous covariates V this assumption is much more likely if τ0>0. It seems unlikely that the treatment effect in some positive probability strata of covariates will concentrate on some arbitrary (determined by the constraint κ) value τ0. Nonetheless, one could deal with situations where Pr0(Qˉb,0(V)=τ0)>0 using similar martingale-based online estimation techniques to those presented in Luedtke and van der Laan [13].

## 3 Estimating the optimal optimal R-C value

We now present an estimation strategy for the optimal R-C rule. The upcoming sections justify this strategy and suggest that it will perform well for a wide variety of data generating distributions. The estimation strategy proceeds as follows:

1. 1.

Obtain estimates Qˉn, Qˉb,n, and gn of Qˉ0, Qˉb,0, and g0 using any desired estimation strategy which respects the fact that Y is bounded in the unit interval.

2. 2.

Estimate the marginal distributions of W and V with the corresponding empirical distributions.

3. 3.

Estimate S0 with the plug-in estimator Sn given by τ1ni=1nIQˉb,n(vi)>τ.

4. 4.

Estimate η0 with the plug-in estimator ηn=Δinfτ:Sn(τ)κ.

5. 5.

Estimate τ0 with the plug-in estimator given by τn=Δmax{ηn,0}.

6. 6.

Estimate d0 with the plug-in estimator dn with distribution

PrPU(dn(U,v)=1)={κSn(τn),ifQˉb,n(v)=τnandτn>0I(Qˉb,n(v)>τn),otherwise.
7. 7.

Run a TMLE for the parameter Ψdn(P0):

1. (a)

For a˜{0,1}, define H(a,w)=ΔPrPU(dn(U,v)=a)gn(a|w). Run a univariate logistic regression using:

Outcome:(yi:i=1,...,n)
Offset:logitQˉn(ai,wi):i=1,...,n
Covariate:H(ai,wi):i=1,...,n.

Let εn represent the estimate of the coefficient for the covariate, i.e.

εn=ΔargmaxεR1ni=1nQˉnε(ai,wi)logyi+1Qˉnε(ai,wi)log(1yi),

where Qˉnε(a,w)=Δlogit1logitQˉn(a,w)+εH(a,w).

2. (b)

Define Qˉn=ΔQˉnεn.

3. (c)

Estimate Ψdn(P0) using the plug-in estimator given by

Ψdn(Pn)=Δ1ni=1na=01Qˉn(a,wi)PrPU(dn(U,vi)=a).

We use Ψdn(Pn) as our estimate of Ψ(P0). We will denote this estimator Ψˆ, where we have defined Ψˆ so that Ψˆ(Pn)=Ψdn(Pn). Note that we have used a TMLE for the data dependent parameter Ψdn(P0), which represents the value under a stochastic intervention dn. Nonetheless, we assume that PrP0(Qˉb,0(V)=τ0)=0 for many of the results pertaining to our estimator Ψˆ, i.e. we assume that the optimal R-C rule is deterministic. We view estimating the value under a stochastic rather than deterministic intervention as worthwhile because one can give conditions under which the above estimator is (root-n) consistent for Ψ(P0) at all laws P0, even if non-negligible bias invalidates standard Wald-type confidence intervals for the parameter of interest at laws P0 for which PrP0(Qˉb,0(V)=τ0)>0.

We will use Pn to denote any distribution for which QˉPn=Qˉn, gPn=gn, and Pn has the marginal empirical distribution of W for the marginal distribution of W. We note that such a distribution Pn exists provided that Qˉn and gn fall in the parameter spaces of PQˉP(W) and PgP, respectively.

In practice we recommend estimating Qˉ0 and Qˉb,0 using an ensemble method such as super-learning to make an optimal bias-variance trade-off (or, more generally, minimize cross-validated risk) between a mix of parametric models and data adaptive regression algorithms [18, 19]. If the treatment mechanism g0 is unknown then we recommend using similar data adaptive approaches to obtain the estimate gn. If g0 is known (as in a randomized controlled trial without missingness), then one can either take gn=g0 or estimate g0 using a correctly specified parametric model, which we expect to increase the efficiency of estimators when the Qˉ0 part of the likelihood is misspecified [20, 21].

There is typically little downside to using data adaptive approaches to estimate the needed portions of the likelihood, though we do give a formal empirical process condition in Section 5.1 which describes exactly how data adaptive these estimators can be. If one is concerned about the data adaptivity of the estimators of the needed portions of the likelihood, then one can consider a cross-validated TMLE approach such as that presented in van der Laan and Luedtke [20] or an online one-step estimator as that presented in Luedtke and van der Laan [13]. These two approaches make no restrictions on the data adaptivity of the estimators of Qˉ0, Qˉb,0, or g0.

We now outline the main results of this paper, which hold under appropriate consistency and regularity conditions.

1. Asymptotic linearity of Ψˆ:

Ψˆ(Pn)Ψ(P0)=1ni=1nD0(Oi)+oP0(n1/2),

with D0 a known function of P0.

1. Ψˆ is an asymptotically efficient estimate of Ψ(P0).

2. One can obtain a consistent estimate σn2 for the variance of D0(O). An asymptotically valid 95% confidence intervals for Ψ(P0) given by Ψˆ(Pn)±1.96σn/n.

The upcoming sections give the consistency and regularity conditions which imply the above results.

## 4 Canonical gradient of the optimal R-C value

The pathwise derivative of Ψ will provide a key ingredient for analyzing the asymptotic properties of our estimator. We refer the reader to Pfanzagl [22] and Bickel et al. [23] for an overview of the crucial role that the pathwise derivative plays in semiparametric efficiency theory. We remind the reader that an estimator Φˆ is an asyptotically linear estimator of a parameter Φ(P0) with influence curve ICP0 provided that

Φˆ(Pn)Φ(P0)=1ni=1nICP0(Oi)+oP0(n1/2).

If Φ is pathwise differentiable with canonical gradient ICP0, then Φˆ is RAL and asymptotically efficient (minimum variance) among all such RAL estimators of Φ(P0) [23, 22].

For oO, a deterministic rule d˜, and a real number τ, define

D1(d˜,P)(o)=ΔI(a=d˜(v))gP(a|w)yQˉP(a,w)
D2(d˜,P)(o)=ΔQˉP(d˜(v),w)EPQˉP(d˜(V),W),

where gP(a|W)=ΔPrP(A=a|W). We will let g0=ΔgP0. We note that D1(d˜,P)+D2(d˜,P) is the efficient influence curve of the parameter Ψ˜d˜(P).

Let d be some stochastic rule. The canonical gradient of Ψd is given by

ICd(P)(o)=ΔEPU[D1(d(U,),P)(o)+D2(d(U,),P)(o)].

Define

D(d,τ,P)(o)=ΔICd(P)(o)τEPUd(U,v)κ.

For ease of reference, let D0=ΔD(d0,τ0,P0). The upcoming theorem makes use of the following assumptions.

• C1. g0 satisfies the positivity assumption: Pr0(0<g0(1|W)<1)=1.

• C2. Qˉb,0(W) has density f0 at η0, and 0<f0(η0)<.

• C3. S0 is continuous in a neighborhood of η0.

• C4. Pr0(Qˉb,0(V)=τ)=0 for all τ in a neighborhood of τ0.

We now present the canonical gradient of the optimal R-C value.

Theorem 3

Suppose C1 through C4. ThenΨis pathwise differentiable atP0with canonical gradientD0.

Note that C3) implies that Pr0(Qˉb,0(V)=τ0)=0. Thus d0 is (almost surely) deterministic and the expectation over PU in the definition of D0 is superfluous. Nonetheless, this representation will prove useful when we seek to show that our estimator solves the empirical estimating equation defined by an estimate of D(d0,τ0,P0).

When the resource constraint is active, i.e. τ0>0, the above theorem shows that Ψ has an additional component over the optimal value parameter when no resource constraints are present [11]. The additional component is τ0×EPU[d0(U,v)]κ, and is the portion of the derivative that relies on the fact that d0 is estimated and falls on the edge of the parameter space. We note that it is possible that the variance of D0(O) is greater than the variance of ICd0(P0)(O). If τ0=0 then these two variances are the same, so suppose τ0>0. Then, provided that Pr0(Qˉb,0(V)=τ0)=0, we have that

VarP0(D0(O))VarP0(ICd0(P0))
=τ0κ(1κ)τ02EP0|Qˉ0(1,W)d˜0(V)=1+2EP0|Qˉ0(0,W)d˜0(V)=0.

For any κ(0,1), it is possible to exhibit a distribution P0 which satisfies the conditions of Theorem 3 and for which VarP0(D0(O))>VarP0(ICd0(P0)(O)). Perhaps more surprisingly, it is also possible to exhibit a distribution P0 which satisfies the conditions of Theorem 3 and for which VarP0(D0(O))<VarP0(ICd0(P0)(O)). We omit further the discussion here because the focus of this work is on considering the estimating the value from the optimization problem (2), rather than discussing how this procedure relates to the estimation of other parameters.

## 5 Results about the proposed estimator

We now show that Ψˆ is an asymptotically linear estimator for Ψ(P0) with influence curve D0 provided our estimates of the needed parts of P0 satisfy consistency and regularity conditions. Our theoretical results are presented in Section 5.1, and the conditions of our main theorem are discussed in Section 5.2.

### 5.1 Inference for Ψ(P0)

For any distributions P and P0 satisfying positivity, stochastic intervention d, and real number τ, define the following second-order remainder terms:

R10(d,P)=ΔEPU×P01g0(d|W)g(d|W)QˉP(d,W)Qˉ0(d,W)
R20(d)=ΔEPU×P0(dd0)(Qˉb,0(V)τ0).

Above the reliance of d and d0 on (U,V) is omitted in the notation. Let R0(d,P)=ΔR10(d,P)+R20(d). The upcoming theorem will make use of the following assumptions.

1. 1.

g0 satisfies the strong positivity assumption: Pr0(δ<g0(1|W)<1δ)=1 for some δ>0.

2. 2.

gn satisfies the strong positivity assumption for a fixed δ>0 with probability approaching 1: there exists some δ>0 such that, with probability approaching 1, Pr0(δ<gn(1|W)<1δ)=1.

3. 3.

R0(dn,Pn)=oP0(n1/2).

4. 4.

EP0D(dn,τ0,Pn)(O)D0(O)2=oP0(1).

5. 5.

D(dn,τ0,Pn) belongs to a P0-Donsker class D with probability approaching 1.

6. 6.

1ni=1nD(dn,τ0,Pn)(Oi)=oP0(n1/2).

We note that the τ0 in the final condition above only enters the expression in the sum as a multiplicative constant in front of EPU[d(U,vi)]κ.

Theorem 4 (

Ψˆis asymptotically linear)Suppose C2) through 6. ThenΨˆis a RAL estimator ofΨ(P0)with influence curveD0, i.e.

Ψˆ(Pn)Ψ(P0)=1ni=1nD0(Oi)+oP0(n1/2).

Further, Ψˆ is efficient among all such RAL estimators of Ψ(P0).

Let σ02=ΔVarP0(D0). By the central limit theorem, nΨˆ(Pn)Ψ(P0) converges in distribution to a N(0,σ02) distribution. Let σn2=Δ1ni=1nD(dn,τn,Pn)(Oi)2 be an estimate of σ02. We now give the following lemma, which gives sufficient conditions for the consistency of τn for τ0.

Lemma 5

(Consistency of τn) Suppose C2) and C3). Also supposeQˉb,nis consistent forQˉb,0inL1(P0)and that the estimateQˉb,nbelongs to aP0Glivenko Cantelli class with probability approaching 1. Thenτnτ0in probability.

It is easy to verify that conditions similar to those of Theorem 4, combined with the convergence of τn to τ0 as considered in the above lemma, imply that σnσ0 in probability. Under these conditions, an asymptotically valid two-sided 1α confidence interval is given by

Ψˆ(Pn)±z1α/2σnn,

where z1α/2 denotes the 1α/2 quantile of a N(0,1) random variable.

### 5.2 Discussion of conditions of Theorem 4

Conditions C2) and C3). These are standard conditions used when attempting to estimate the κ-quantile η0, defined in eq. (3). Provided good estimation of Qˉb,0, these conditions ensure that gathering a large amount of data will enable one to get a good estimate of the κ-quantile of the random variable Qˉb,0. See Lemma 5 for an indication of what is meant by “good estimation” of Qˉb,0. It seems reasonable to expect that these conditions will hold when V contains continuous random variables and η0/=0, since we are essentially assuming that Qˉb,0 is not degenerate at the arbirtrary (determined by κ) point η0.

Condition C4). If τ0>0, then C4) is implied by C3). If τ0=0, then C4) is like assuming a non-exceptional law, i.e. that the probability of a there being no treatment effect in a strata of V is zero. Because τ0 is not known from the outset, we require something slightly stronger, namely that the probability of any specific small treatment effect is zero in a strata of V is zero. Note that this condition does not prohibit the treatment effect from being small, e.g. Pr0(|Qˉb,0(V)|<τ)>0 for all τ>0, but rather it prohibits there existing a sequence τm0 with the property that Pr0(Qˉb,0(V)=τm)>0 infinitely often. Thus this condition does not really seem any stronger than assuming a non-exceptional law. If one is concerned about such exceptional laws then we suggest adapting the methods in [13] to the R-C setting.

Condition 1. This condition assumes that people from each strata of covariates have a reasonable (at least a δ>0) probability of treatment.

Condition 2. This condition requires that our estimates of g0 respect the fact that each strata of covariates has a reasonable probability of treatment.

Condition 3. This condition is satisfied if R10(dn,Pn)=oP0(n1/2) and R20(dn)=oP0(n1/2). The term R10(dn,Pn) takes the form of a typical double robust term that is small if either or Qˉ0 is estimated well, and is second-order, i.e. one might hope that R10(dn,Pn)=oP0(n1/2), if both both g0 and Qˉ0 are estimated well. One can upper bound this remainder with a product of the L2(P0) rates of convergence of these two quantities using the Cauchy-Schwarz inequality. If g0 is known, then one can take gn=g0 and this term is zero.

Ensuring that R20(dn)=oP0(n1/2) requires a little more work but will still prove to be a reasonable condition. We will use the following margin assumption for some α>0:

(5)Pr00<|Qˉb,0τ0|ttαforallt>0,

where ‘‘’’ denotes less than or equal to up to a multiplicative constant. This margin assumption is analogous to that used in Audibert and Tsybakov [24]. The following result relates the rate of convergence of R20(dn) to the rate at which Qˉb,nτn converges to Qˉb,0τ0.

Theorem 6

If eq. (5) holds for someα>0, then

1. 1.

|R20(dn)|(Qˉb,nτn)(Qˉb,0τ0)2,P02(1+α)/(2+α)

2. 2.

|R20(dn)|(Qˉb,nτn)(Qˉb,0τ0),P01+α.

The above is similar to Lemma 5.2 in Audibert and Tsybakov [24], and a similar result was proved in the context of optimal ITRs without resource constraints in Luedtke and van der Laan [13]. If S0 has a finite derivative at τ0, as is given by C2), then one can take α=1. The above theorem then implies that R20(dn)=oP0(n1/2) if either (Qˉb,nτn)(Qˉb,0τ0)2,P0 is oP0(n3/8) or (Qˉb,nτn)(Qˉb,0τ0),P0 is oP0(n1/4).

Condition 4. This is a mild consistency condition which is implied by the L2(P0) consistency of dn, gn, and Qˉn to d0, g0, and Qˉ0. We note that the consistency of the intial (unfluctuated) estimate Qˉn for Qˉ0 will imply the consistency of Qˉn to Qˉ0 given 2, since in this case εn0 in probability, and thus QˉnQˉn2,P00 in probability.

Condition 5. This condition places restrictions on how data adaptive the estimators of d0, g0, and Qˉ0 can be. We refer the reader to Section 2.10 of van der Vaart and Wellner [25] for conditions under which the estimates of d0, g0, and Qˉ0 belonging to Donsker classes implies that D(dn,τ0,Pn) belongs to a Donsker class. We note that this condition was avoided for estimating the value function using a cross-validated TMLE in van der Laan and Luedtke [20] and using an online estimator of the value function in Luedtke and van der Laan [13], and using either technique will allow one to avoid the condition here as well.

Condition 6. Using the notation Pf=f(o)dP(o) for any distribution P and function f:OR, we have that

PnD(dn,τ0,Pn)=PnD1(dn,Pn)+PnD2(dn,Pn)
τ01ni=1nEPU[dn(U,vi)]κ.

The first term is zero by the fluctuation step of the TMLE algorithm and the second term on the right is zero because Pn uses the empirical distribution of W for the marginal distribution of W. If τ0=0 then clearly the third term is zero, so suppose τ0>0. Combining eq. (4) and the fact that dn is a substitution estimator shows that the third term is 0 with probability approaching 1 provided that τn>0 with probability approaching 1. This will of course occur if τnτ0>0 in probability, for which Lemma 5 gives sufficient conditions.

## 6 Simulation methods

We simulated i.i.d. draws from two data generating distributions at sample sizes 100, 200, and 1,000. For each sample size and distribution we considered resource constraints κ=0.1 and κ=0.9. We ran 2,000 Monte Carlo draws of each simulation setting. All simulations were run in R [26].

We first present the two data generating distributions considered, and then present the estimation strategies used.

### 6.1 Data generating distributions

#### 6.1.1 Simulation 1

Our first data generating distribution is identical to the single time point simulation considered in van der Laan and Luedtke [20] and Luedtke and van der Laan [18]. The outcome is binary and the baseline covariate vector W=(W1,...,W4) is four dimensional for this distribution, with

W1,W2,W3,W4i.i.d.N(0,1)
A|WBernoulli(1/2)
logitEP0Y|A,W,H=0=1W12+3W2+A5W324.45
logitEP0Y|A,W,H=1=0.5W3+2W1W2+A3|W2|1.5,

where H is an unobserved Bernoulli(1/2) variable independent of A,W. For this distribution EP0[Qˉ0(0,W)]EP0[Qˉ0(1,W)]0.464.

We consider two choices for V, namely V=W3, and V=W1,...,W4. We obtained estimates of the approximate optimal R-C optimal value for this data generating distribution using 107 Monte Carlo draws. When κ=0.1, Ψ(P0)0.493 for V=W3 and Ψ(P0)0.511 for V=W1,...,W4. When κ=0.9, Ψ(P0)0.536 for V=W3 and Ψ(P0)0.563 for V=W1,...,W4. We note that the resource constraint is not active (τ0=0) when κ=0.9 for either choice of V.

#### 6.1.2 Simulation 2

Our second data generating distribution is very similar to one of the distributions considered in Luedtke and van der Laan [13], though has been modified so that the treatment effect is positive for all values of the covariate. The data are generated as follows:

W~Uniform(1,1)
A|W~Bernoulli(1/2)
Y|A,W˜Bernoulli(Q0(A,W)),

where for W˜=ΔW+5/6 we define

Qˉ0(A,W)610=Δ{0,ifA=1and1/2W1/3W˜3+W˜213W˜+127,ifA=1andW<1/2W3+W213W+127,ifA=1andW>1/3310,otherwise.

For this distribution EP0[Qˉ0(0,W)]=0.3 and EP0[Qˉ0(1,W)]0.583.

We use V=W. This simulation is an example of a case where Qˉb,0(V)>0 almost surely, so any constraint on resources will reduce the optimal value from its unconstrained value of 0.583. In particular, we have that Ψ(P0)0.337 when κ=0.1 and Ψ(P0)0.572 when κ=0.9.

### 6.2 Estimating nuisance functions

We treated g0 as known in both simulations and let gn=g0. We estimated Qˉ0 using the super-learner algorithm with the quasi-log-likelihood loss function (family = binomial) and a candidate library of data adaptive (SL.gam and SL.nnet) and parametric algorithms (SL.bayesglm, SL.glm, SL.glm.interaction, SL.mean, SL.step, SL.step.interaction, and SL.step.forward). We refer the reader to table 2 in the technical report Luedtke and van der Laan [18] for a brief description of these algorithms. We estimated Qˉb,0 by running a super-learner using the squared error loss function and the same candidate algorithms and used W to predict the outcome Y˜=Δ2A1g0(A|W)(YYˉn)+Yˉn, where Yˉn represents the sample mean of Y from the n observations. See Luedtke and van der Laan [18] for a justification of this estimation scheme.

Once we had our estimates Qˉn, Qˉb,n, and gn we proceeded with the estimation strategy described in Section 3.

### 6.3 Evaluating performance

We used three methods to evaluate our proposed approach. First, we looked at the coverage of two-sided 95% confidence intervals for the optimal R-C value. Second, we report the average confidence interval widths. Finally, we looked at the power of the α=0.025 level test H0:Ψ(P0)=μ0 against H1:Ψ(P0)>μ0, where μ0=ΔE0[Qˉ0(0,W)] is treated as a known quantity. Under causal assumptions, μ0 can be identified with the counterfactual quantity representing the population mean outcome if, possibly contrary to fact, no one receives treatment. If treatment is not currently being given in the population, one could substitute the population mean outcome (if known) for μ0. Our test of significance consisted of checking of the lower bound in the two-sided 95% confidence interval is greater than μ0. If an estimator of Ψ(P0) is low-powered in testing H0 against H1 then clearly the estimator will have little practical value.

## 7 Simulation results

### Figure 1:

Coverage of two-sided 95% confidence intervals. As expected, coverage increases with sample size. The coverage tends to be better for κ=0.1 than for κ=0.9, though the estimator performed well at the largest sample size (1,000) for all simulations and choices of κ. Error bars indicate 95% confidence intervals to account for uncertainty from the finite number of Monte Carlo draws.

The proposed estimation strategy performed well overall. Figure 1 demonstrates the coverage of 95% confidence intervals for the optimal R-C value. All methods performed well at all sample sizes for the highly constrained setting where κ=0.1. The results were more mixed for the resource constraint κ=0.9. All methods performed well at the largest sample size considered. This supports our theoretical results, which were all asymptotic in nature. For Simulation 1, in which the resource constraint was not active for either choice of V, the coverage dropped off at lower sample sizes. Coverage was approximately 90% in the two smaller sample sample sizes for V=W3, which may be expected for such an asymptotic method. For the more complex problem of estimating the optimal value when V=W1,...,W4 the coverage was somewhat lower (80% when n=100 and 84% when n=200). In Simulation 2, the coverage was better (>91%) for the smaller sample sizes. We note that the resource constraint was still active (τ0>0) when κ=0.9 for this simulation, and also that the estimation problem is easier because the baseline covariate was univariate.

We report the average confidence interval widths across the 2,000 Monte Carlo draws. For n=100, average confidence interval widths were between 0.25 and 0.26 across all simulations and choices of κ. For n=200, all average confidence interval widths were between 0.17 and 0.18. For n=1000, all average confidence interval widths were approximately 0.08. We note that the usefulness of such confidence intervals varies across simulations and choices of κ. When V=W3 and κ=0.1 in Simulation 1, the optimal R-C value is approximately 0.493, versus a baseline value μ0=EP0[Qˉ0(0,W)] of approximately 0.464. Thus here the confidence interval would give the investigator little information, even at a sample size of 1,000. In Simulation 2 with κ=0.9, on the other hand, the optimal R-C value is approximately 0.572, versus a baseline value of μ00.3. Thus here all confidence intervals would likely be informative for investigators, even those made for data sets of size 100.

### Figure 2:

Power of the α=0.025 level test of H0:Ψ(P0)=μ0 against H1:Ψ(P0)>μ0, where μ0=EP0[Qˉ0(0,W)] is treated as known. Power increases with sample size and κ. Error bars indicate 95% confidence intervals to account for uncertainty from the finite number of Monte Carlo draws.

Figure 2 gives the power of the α=0.025 level test H0:Ψ(P0)=μ0 against the alternative H1:Ψ(P0)>μ0. Overall our method appears to have reasonable power in this statistical test. We see that power increases with sample size, the key property of consistent statistical tests. We also see that power increases with κ, which is unsurprising given that Y is binary and g0(a|w) is 1/2 for all a,w. We note that power will not always increase with κ, for example if P0 is such that g0(1|w) is very small for individuals with covariate w who are treated at κ=0.9 but not at κ=0.1. This observation is not meant as a criticism to the estimation scheme that we have presented because we assume that κ will be chosen to reflect real resource constraints, rather than to maximize the power for a test H0:Ψ(P0)=μ versus H1:Ψ(P0)>μ for some fixed μ.

We also implemented an estimating equation based estimator for the optimal R-C value and found the two methods performed similarly. We would recommend using the TMLE in practice because it has been shown to be robust to near positivity violations in a wide variety of settings [27]. We note that g0(1|w)=1/2 for all w in both of our simulations, so no near positivity violations occurred. We do not consider the estimation equation approach any further here because the focus of this work is on considering the optimization problem (2), rather than on comparing different estimation frameworks.

## 8 Discussion and future work

We have considered the problem of estimating the optimal resource-constrained value. Under causal assumptions, this parameter can be identified with the maximum attainable population mean outcome under individualized treatment rules which rely on some summary of measured covariates, subject to the constraint that a maximum proportion κ of the population can be treated. We also provided an explicit expression for an optimal stochastic rule under the resource constraint.

We derived the canonical gradient of the optimal R-C value under the key assumption that the treatment effect is not exactly equal to τ0 in some strata of covariates which occurs with positive probability. The canonical gradient plays a key role in developing asymptotically linear estimators. We found that the canonical gradient of the optimal R-C value has an additional component when compared to the canonical gradient of the optimal unconstrained value when the resource constraint is active, i.e. when τ0>0.

We presented a targeted minimum loss-based estimator for the optimal R-C value. This estimator was designed to solve the empirical mean of an estimate of the canonical gradient. This quickly yielded conditions under which our estimator is RAL, and efficient among all such RAL estimators. All of these results rely on the condition that the treatment effect is not exactly equal to τ0 for positive probability strata of covariates. This assumption is more plausible than the typical non-exceptional law assumption when the covariates are continuous and the constraint is active because it may be unlikely that the treatment effect concentrates on an arbitrary (determined by κ) τ0>0. We note that this pseudo-non-exceptional law assumption has implied that the optimal stochastic rule is almost surely equal to the optimal deterministic rule. Though we have not presented formal theorems here, it is not difficult to derive conditions under which our estimator of the optimal value under a R-C stochastic rule is (root-n) consistent even when the treatment effect is equal to τ0 with positive probability, though the bias will be non-negligible (converge to zero at the same root-n rate as the variance). One could use an analogue of the variance-stabilized online estimator presented in Luedtke and van der Laan [13] to get inference for the optimal R-C value in this setting.

Our simulations confirmed our theoretical findings. We found that coverage improved with sample size, with near-nominal coverage at the largest sample size considered. This is not surprising given that most of our analytic results were asymptotic, though we note that the method also performed well at the smaller sample sizes considered. The confidence intervals were informatively tight when one considered the difference between the optimal R-C value and the value under no treatment. Further simulations are needed to fully understand the behavior of this method in practice.

Some resource constraints encountered in practice may not be of the form EPU×P0[d(U,V)]κ. For example, the cost of distributing the treatment to people may vary based on the values of the covariates. For simplicity assume V=W. If c:W[0,) is a cost function, then this constraint may take the form EPU×P0[c(W)d(U,W)]κ. If τ0=0, then an optimal stochastic rule under such a constraint takes the form (u,w)I(Qˉb,0(w)>0). If τ0>0, then an optimal stochastic rule under such a constraint takes the form (u,w)I(Qˉb,0(w)>τ0c(w)) for w for which Qˉb,0(w)/=τ0c(w) or c(w)=0, and randomly distributes the remaining resources uniformly among all remaining w. We leave further consideration of this more general resource constraint problem to future work.

In this work our primary focus has been on estimating the optimal value under a resource constraint, rather than the optimal rule under a resource constraint. Nonetheless, our estimation procedure yields an estimate dn of the optimal R-C rule. It would be interesting to further analyze dn in future work to better understand how well this estimator will perform, or if there are better estimators which more directly frame the estimation challenge as a (weighted) classification problem [28, 29]. Note that we are not guaranteed that dn satisfies the constraint, i.e. it is quite possible that EPU×P0[dn(U,V)]>κ, though concentration inequalities suggest that one can give conditions under which EPU×P0[dn(U,V)]κ is small with probability approaching 1. One could also seek an optimal rule estimate dn which satisfies that, with probability at least 1δ for some user-defined δ>0, EPU×P0[dn(U,V)]κ.

Further work is needed to generalize this work to the multiple time point setting. Before generalizing the procedure, one must know exactly what form the multiple time point constraint takes. For example, it may be the case that only a κ proportion of the population can be treated at each time point, or it may be the case that treatment can only be administered at a κ proportion of patient-time point pairs. Regardless of which constraint one chooses, it seems that the nice recursive structure encountered in Q-learning may not hold for multiple time point R-C problems. While useful for computational considerations, being able to express the optimal rule using approximate dynamic programming is not necessary for the existence of a good optimal rule estimator, especially when the number of time points is small. If the computational complexity of the procedure is a major concern, it may be beneficial to frame the multiple time point learning problem as a single optimization problem, using smooth surrogates for indicator functions as Zhao et al. [30] do when they introduce simultaneous outcome weighted learning (SOWL). One would then need to appropriately account for the fact that the empirical resource constraint may only be approximately satisfied.

We have not considered the ethical considerations associated with allocating limiting resources to a population. The debate over the appropriate means to distribute limited treatment resources to a population is ongoing (see, e.g., Brock and Wikler [31], Macklin and Cowan [32], Singh [33], for examples in the treatment of HIV/AIDS). Clearly any investigator needs to consider the ethical issues associated with certain resource allocation schemes. Our method is optimal in a particular utilitarian sense (maximizing the expected population mean outcome with respect to an outcome of interest) and yields a treatment strategy which treats individuals who are expected to benefit most from treatment in terms of our outcome of interest. One must be careful to ensure that the outcome of interest truly captures the most important public health implications. Unlike in unconstrained individualized medicine, inappropriately prescribing treatment to a stratum will also have implications for individuals outside of that strata, namely for the individuals who do not receive treatment due to its lack of availability. We leave further ethical considerations to experts on the matter. It will be interesting to see if there are settings in which it is possible to transform the outcome or add constraints to the optimization problem so that the statistical problem considered in this paper adheres to the ethical guidelines in those settings.

We have looked to generalize previous works in estimating the value of an optimal individualized treatment regime to the case where the treatment resource is a limited resource, i.e. where it is not possible to treat the entire population. This work should allow for the application of optimal personalized treatment strategies to many new problems of interest.

Funding statement: Funding: This research was supported by NIH grant R01 AI074345-06. AL was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.

### References

1. Murphy SA. Optimal dynamic treatment regimes. J Roy Stat Soc Ser B 2003;650:331–6. Search in Google Scholar

2. Robins JM. Optimal structural nested models for optimal sequential decisions. In: DY Lin and Heagerty P, editors. Proc. Second Seattle Symp. Biostat, 179, 2004:189–326. Search in Google Scholar

3. Chakraborty B, Moodie EE. Statistical methods for dynamic treatment regimes. Berlin, Heidelberg, New York: Springer, 2013. Search in Google Scholar

4. Arjas E, Jennison C, Dawid AP, Cox DR, Senn S, Cowell RG, et al. Optimal dynamic treatment regimes – discussion on the paper by murphy. J Roy Stat Soc Ser B 2003;65:355–66. Search in Google Scholar

5. Lasry A, Sansom SL, Hicks KA, Uzunangelov V. A model for allocating CDCs HIV prevention resources in the United States. Health Care Manag Sci 2011;140:115–24. Search in Google Scholar

6. Tao G, Zhao K, Gift T, Qiu F, Chen G. Using a resource allocation model to guide better local sexually transmitted disease control and prevention programs. Oper Res Heal Care 2012;10:23–9. Search in Google Scholar

7. Karp RM. Reducibility among combinatorial problems. New York, Berlin, Heidelberg: Springer, 1972. Search in Google Scholar

8. Korte B, Vygen J. Combinatorial optimization, 5th ed. Berlin, Heidelberg, New York: Springer, 2012. Search in Google Scholar

9. Dantzig GB. Discrete-variable extremum problems. Oper Res 1957;50:266–88. Search in Google Scholar

10. Zhang B, Tsiatis A, Davidian M, Zhang M, Laber E. A robust method for estimating optimal treatment regimes. Biometrics 2012;68:1010–18. Search in Google Scholar

11. van der Laan MJ, Luedtke AR. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical report 329. Available at: http://www.Bepress.Com/Ucbbiostat/, Division of Biostatistics, University of California, Berkeley, 2014a. Search in Google Scholar

12. Goldberg Y, Song R, Zeng D, Kosorok MR. Comment on dynamic treatment regimes: technical challenges and applications. Electron J Stat 2014;8:1290–300. Search in Google Scholar

13. Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Technical Report 332. Available at: http://biostats.bepress.com/ucbbiostat/paper332/, Division of Biostatistics, University of California, Berkeley, submitted to Annals of Statistics, 2014b. Search in Google Scholar

14. Chakraborty B, Laber EB, Zhao Y-Q. Inference about the expected performance of a data-driven dynamic treatment regime. Clin Trials 2014;110:408–17. Search in Google Scholar

15. Pearl J. Causality: models, reasoning and inference, 2nd ed. New York: Cambridge University Press, 2009. Search in Google Scholar

16. Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. Comput Math Appl 1987;140:139s–161s. ISSN 0097–4943. Search in Google Scholar

17. Daz I, van der Laan MJ. Population intervention causal effects based on stochastic interventions. Biometrics 2012;680:541–9. Search in Google Scholar

18. Luedtke AR, van der Laan MJ. Super-learning of an optimal dynamic treatment rule. Technical Report 326. Available at: http://www.bepress.com/ucbbiostat/, Division of Biostatistics, University of California, Berkeley, under Review at JCI, 2014a. Search in Google Scholar

19. van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genet Mol 2007;60. Article 25.ISSN 1. Search in Google Scholar

20. van der Laan MJ, Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. J Causal Inference 2014b. (Ahead Of print). DOI: 10.1515/jci-2013-0022. Search in Google Scholar

21. van der laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York, Berlin, Heidelberg: Springer, 2003. Search in Google Scholar

22. Pfanzagl J. Estimation in semiparametric models. Berlin, Heidelberg, New York: Springer, 1990. Search in Google Scholar

23. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press, 1993. Search in Google Scholar

24. Audibert JY, Tsybakov AB. Fast learning rates for plug-in classifiers. Ann Statist 2007;350:608–33. Search in Google Scholar

25. van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Berlin, Heidelberg, New York: Springer, 1996. Search in Google Scholar

26. R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2014. Available at: http://www.r-project.org/. Search in Google Scholar

27. van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011. Search in Google Scholar

28. Rubin DB, van der laan MJ. Statistical issues and limitations in personalized medicine research with clinical trials. Int J Biostat 2012;8:article 18. Search in Google Scholar

29. Zhao Y, Zeng D, Rush A, Kosorok M. Estimating individual treatment rules using outcome weighted learning. J Am Stat Assoc 2012;107:1106–18. Search in Google Scholar

30. Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Am Stat Assoc 2014. Search in Google Scholar

31. Brock DW, Wikler D. Ethical challenges in long-term funding for HIV/AIDS. Health Aff 2009;280:1666–76. Search in Google Scholar

32. Macklin R, Cowan E. Given financial constraints, it would be unethical to divert antiretroviral drugs from treatment to prevention. Health Aff 2012;310:1537–44. Search in Google Scholar

33. Singh JA. Antiretroviral resource allocation for HIV prevention. AIDS 2013;270:863–5. Search in Google Scholar

## Appendix: Proofs

### Proofs for Section 2

We first state a simple lemma.

Lemma 6

For a distribution P and a stochastic rule d, we have the following representation forΨd:

Ψd(P)=ΔEPU×Pd(U,V)Qˉb,P(V)+EP[QˉP(0,W)].

Proof of Lemma 6. We have that

Ψd(P)=EPU×P[d(U,V)QˉP(1,W)]+EPU×P[(1d(U,V))QˉP(0,W)]
=EPU×P[d(U,V)(QˉP(1,W)QˉP(0,W))]+EP[QˉP(0,W)]
=EPU×P[d(U,V)Qˉb,P(V)]+EP[QˉP(0,W)],

where the final equality holds by the law of total expectation.□

Proof of Theorem 1. This result will be a consequence of Theorem 2. If PrP(Qˉb,0(V)=τP)=0, then dP(U,V) is PU×P almost surely equal to d˜P(V), and thus Ψ˜d˜P(P)=ΨdP(P). Thus (u,v)d˜P(v) is an optimal stochastic regime. Because the class of deterministic regimes is a subset of the class of stochastic regimes, d˜P is an optimal deterministic regime.□

Proof of Theorem 2. Let d be some stochastic treatment rule which satisfies the resource constraint. For (b,c){0,1}2, define Bbc=Δ{(u,v):dP(u,v)=b,d(u,v)=c}. Note that

ΨdP(P)Ψd(P)=EPU×PdP(U,V)d(U,V)Qˉb,0(V)
(6)=EPU×PQˉb,0(V)I((U,V)B10)EPU×PQˉb,0(V)I((U,V)B01)

The Qˉb,0(V) in the first term in 1 can be upper bounded by τP, and in the second term can be lower bounded by τP. Thus,

ΨdP(P)Ψd(P)τPPrPU×P(U,V)B10PrPU×P(U,V)B01
=τPPrPU×P(U,V)B10B11PrPU×P(U,V)B01B11
=τP(EPU×P[dP(U,V)]EPU×P[d(U,V)]).

If τP=0 then the final line is zero. Otherwise, EPU×P[dP(U,V)]=κ by eq. (4). Because d satisfies the resource constraint, EPU×Pd(U,V)κ and thus the final line above is at least zero. Thus ΨdP(P)Ψd(P)0 for all τP. Because d was arbitrary, dP is an optimal stochastic rule.□

### Proofs for Section 4

Proof of Theorem 3. The pathwise derivative of Ψ(Q) is defined as |ddεΨ(Q(ε))ε=0 along paths {Pε:ε}M. In particular, these paths are chosen so that

dQW,ε=(1+εHW(W))dQW,
whereEHW(W)=0andCW=Δsupw|HW(w)|<;
dQY,ε(Y|A,W)=(1+εHY(Y|A,W))dQY(Y|A,W),
whereE(HY|A,W)=0andsupw,a,y|HY(y|a,w)|<.

The parameter Ψ is not sensitive to fluctuations of g0(a|w)=Pr0(a|w), and thus we do not need to fluctuate this portion of the likelihood. Let Qˉb,ε=ΔQˉb,Pε, Qˉε=ΔQˉPε, dε=ΔdPε, ηε=ΔηPε, τε=ΔτPε, and Sε=ΔSPε. First note that

(7)Qˉb,ε(v)=Qˉb,0(v)+εhε(v)

for an hε with

(8)sup|ε|<1supv|hε(v)|=ΔC1<.

Note that C4) implies that d0 is (almost surely) deterministic, i.e. d0(U,) is almost surely a fixed function. Let d˜ represent the deterministic rule vI(Qˉb,0(v)>0) to which d(u,) is (almost surely) equal for all u. By Lemma 1,

(9)Ψ(Pε)Ψ(P0)=w(EPU[dε(U,V)]d˜0(V))Q¯b,εdQW,ε+wd˜0(V)(Q¯b,εdQW,εQ¯b,0dQW,0)+EPεQ¯ε(0,W)EP0Q¯0(0,W)=w(EPU[dε(U,V)]d˜0(V))(Q¯b,ετ0)dQW,ε+τ0w(EPU[dε(U,V)]dQW,εd˜0(V)dQW,0)τ0wd˜0(V)(dQW,εdQW,0)+Ψd0(Pε)Ψd0(P0).

Dividing the fourth term by ε and taking the limit as ε0 gives the pathwise derivative of the mean outcome under the rule that treats d0 as known. The third term can be written as ετ0wd˜0(V)HWdQW,0, and thus the pathwise derivative of this term is wτ0d˜0(V)HWdQW,0. If τ0>0, then EPU×P0[d˜0(V)]=κ. The pathwise derivative of this term is zero if τ0=0. Thus, for all τ0,

limε01ετ0wd˜0(V)dQW,εdQW,0=wτ0(d˜0(v)κ)HW(w)dQW,0(w).

Thus the third term in eq. (9) generates the vτ0(d˜0(v)κ) portion of the canonical gradient, or equivalently vτ0(EPU[d0(U,v)]κ). The remainder of this proof is used to show that the first two terms in eq. (9) are o(ε).

Step 1: ηεη0.

We refer the reader to eq. (3) for a definition of the quantile PηP. This is a consequence of the continuity of S0 in a neighborhood of η0. For γ>0,

(10)|ηεη0|>γimpliesthatSε(η0γ)κorSε(η0+γ)>κ.

For positive constants C1 and CW,

Sε(η0γ)(1CW|ε|)Pr0Qˉb,ε>η0γ(1CW|ε|)S0(η0γ+C1|ε|).

Fix γ>0 small enough so that S0 is continuous at η0γ. In this case we have that S0(η0γ+C1|ε|)S0(η0γ) as ε0. By the infimum in the definition of η0, we know that S0(η0γ)>κ. Thus Sε(η0γ)>κ for all |ε| small enough.

Similarly, Sε(η0+γ)(1+CW|ε|)S0(η0+γC1|ε|). Fix γ>0 small enough so that S0 is continuous at η0+γ. Then S0(η0+γC1|ε|)S0(η0+γ) as ε0. Condition C2) implies the uniqueness of the κ-quantile of Qˉb,0, and thus that S0(η0+γ)<κ. It follows that Sε(η0+γ)<κ for all |ε| small enough.

Combining Sε(η0γ)>κ and Sε(η0+γ)<κ for all ε close to zero with eq. (10) shows that ηεη0 as ε0.

Step 2: Second term of eq. (9) is 0 eventually.

If τ0=0 then the result is immediate, so suppose τ0>0. By the previous step, ηεη0, which implies that τετ0>0 by the continuity of the max function. It follows that τε>0 for ε large enough. By eq. (4), PrPU×Pε(dε(U,V)=1)=κ for all sufficiently small |ε| and Pr0(d˜0(V)=1)=κ. Thus the second term of eq. (9) is 0 for all |ε| small enough.

Step 3: τετ0=O(ε).

Note that κ<Sε(ηε|ε|)(1+CW|ε|)S0(ηε(1+C1)|ε|). A Taylor expansion of S0 about η0 shows that

κ<1+CW|ε|S0(η0)+(ηεη0(1+C1)|ε|)(f0(η0)+o(1))
=κ+(ηεη0(1+C1)|ε|)(f0(η0)+o(1))+O(ε)
(11)=κ(ηεη0)f0(η0)+o(ηεη0)+O(ε).

The fact that f0(η0)(0,) shows that ηεη0 is bounded above by some O(ε) sequence. Similarly, κSε(ηε+|ε|)(1CW|ε|)S0(ηε+(1+C1)|ε|). Hence,

κ1CW|ε|S0(η0)+(ηεη0+(1+C1)|ε|)(f0(η0)+o(1))
=κ(ηεη0)f0(η0)+o(ηεη0)+O(ε).

It follows that ηεη0 is bounded below by some O(ε) sequence. Combining these two bounds shows that ηεη0=O(ε), which immediately implies that τετ0=max{O(ε),0}=O(ε).

Step 4: First term of eq. (9) iso(ε).

We know that

Qˉb,0(V)τ0+O(ε)Qˉb,ε(V)τεQˉb,0(V)τ0+O(ε).

By C4), it follows that there exists some δ>0 such that sup|ε|<δPr0(Qˉb,ε(V)=τε)=0. By the absolute continuity of QW,ε with respect to QW,0, sup|ε|<δPrPε(Qˉb,ε(V)=τε)=0. It follows that, for all small enough |ε| and almost all u, dε(u,v)=I(Qˉb,ε(v)>τε). Hence,

wEPU[dε(U,V)]d0(V)Qˉb,ετ0dQW,ε=wI(Qˉb,ε>τε)I(Qˉb,0>τ0)Qˉb,ετ0dQW,εwI(Qˉb,ε>τε)I(Qˉb,0>τ0)Qˉb,0τ0+C1|ε|dQW,εwI(|Qˉb,0τ0||Qˉb,0τ0Qˉb,ε+τε|)Qˉb,0τ0+C1|ε|dQW,ε=wI(0<|Qˉb,0τ0||Qˉb,0τ0Qˉb,ε+τε|)Qˉb,0τ0+C1|ε|dQW,εO(ε)wI(0<|Qˉb,0τ0|O(ε))dQW,εO(ε)(1+CW|ε|)Pr00<|Qˉb,0τ0|O(ε),

where the penultimate inequality holds by Step 3 and eq. (7). The last line above is o(ε) because Pr(0<Xε)0 as ε0 for any random variable X. Thus dividing the left-hand side above by ε and taking the limit as ε0 yields zero.□

### Proofs for Section 5

We give the following lemma before proving Theorem 4.

Lemma 7

LetP0and P be distributions which satisfy the positivity assumption and for which Y is bounded in probability. Let d be some stochastic treatment rule andτbe some real number. We have thatΨd(P)Ψ(P0)=EP0[D(d,τ0,P)(O)]+R0(d,P).

Proof of Lemma 12. Note that

Ψd(P)Ψ(P0)+EP0[D(d,τ0,P)(O)]
=Ψd(P)Ψd(P0)+j=12EPU×P0[Dj(d(U,),P)(O)]
+Ψd(P0)Ψd0(P0)τ0EPU×P0[d(U,V)κ].

Standard calculations show that the first term on the right is equal to R10(d,P) [21]. If τ0>0, then eq. (4) shows that τ0EPU×P0[dκ]=τ0EPU×P0[dd0]. If τ0=0, then obviously τ0EPU×P0[dκ]=τ0EPU×P0[dd0]. Lemma 1 shows that Ψd(P0)Ψd0(P0)=EPU×P0[(dd0)Qˉb,0]. Thus the second line above is equal to R20(d).□

Proof of Theorem 4. We make use of empirical process theory notation in this proof so that Pf=EP[f(O)] for a distribution P and function f. We have that

Ψˆ(Pn)Ψ(P0)
(12)=P0D(dn,τ0,Pn)+R0(dn,Pn)
(13)=(PnP0)D(dn,τ0,Pn)+R0(dn,Pn)+oP0(n1/2)
=(PnP0)D0+(PnP0)(D(dn,τ0,Pn)D0)+R0(dn,Pn).

The middle term on the last line is oP0(n1/2) by 1, 2, 4, and 5 [25], and the third term is oP0(n1/2) by 3. This yields the asymptotic linearity result. Proposition 1 in Section 3.3 of Bickel et al. [23] yields the claim about regularity and asymptotic efficiency when conditions C2), C3), C4), and 1 hold (see Theorem 3).□

Proof of Lemma 5. We will show that ηnη0 in probability, and then the consistency of τn follows by the continuous mapping theorem. By C3), there exists an open interval N containing η0 on which S0 is continuous. Fix ηN. Because Qˉb,n belongs to a Glivenko-Cantelli class with probability approaching 1, we have that

Sn(η)S0(η)=PnI(Qˉb,n>η)P0I(Qˉb,0>η)
P0I(Qˉb,n>η)I(Qˉb,0>η)+(PnP0)I(Qˉb,n>η)
(14)|Sn(η)S0(η)|=|PnI(Q¯b,n>η)P0I(Q¯b,0>η)||P0(I(Q¯b,n>η)I(Q¯b,0>η))|+|(PnP0)I(Q¯b,n>η)||P0(I(Q¯b,n>η)I(Q¯b,0>η))|Tn(η)+oP0(1),

where we use the notation Pf=EP[f(O)] for any distribution P and function f. Let Zn(η)(w)=ΔI(Qˉb,n(w)>η)I(Qˉb,0(w)>η)2. The following display holds for all q>0:

Tn(η)P0Zn(η)
=P0Zn(η)I(|Qˉb,0η|>q)+P0Zn(η)I(|Qˉb,0η|q)
(15)=P0Zn(η)I(|Qˉb,0η|>q)+P0Zn(η)I(0<|Qˉb,0η|q)
(16)P0Zn(η)I(|Qˉb,nQˉb,0|>q)+P0Zn(η)I(0<|Qˉb,0η|q)
Pr0|Qˉb,nQˉb,0|>q+Pr00<|Qˉb,0η|q
P0|Qˉb,nQˉb,0|q+Pr00<|Qˉb,0η|q.

Above eq. (15) holds because C3) implies that Pr0(Qˉb,0=η)=0, eq. (16) holds because Zn(η)=1 implies that |Qˉb,nQˉb,0||Qˉb,0η|, and the final inequality holds by Markov’s inequality. The lemma assumes that EP0|Qˉb,nQˉb,0|=oP0(1), and thus we can choose a sequence qn0 such that

Tn(η)Pr00<|Qˉb,0η|qn+oP0(1).

To see that the first term on the right is o(1), note that Pr0(Qˉb,0=η)=0 combined with the continuity of S0 on N yield that, for n large enough,

Pr00<|Qˉb,0η|qn=S0(qn+η)S0(qn+η).

The right-hand side is 0(1), and thus Tn(η)=oP0(1). Plugging this into eq. (14) shows that Sn(η)S0(η) in probability. Recall that ηN was arbitrary.

Fix γ>0. For γ small enough, η0γ and η0+γ are contained in N. Thus Sn(η0γ)S0(η0γ) and Sn(η0+γ)S0(η0+γ) in probability. Further, S0(η0γ)>κ by the definition of η0 and S0(η0+γ)<κ by Condition C2). It follows that, with probability approaching 1, Sn(η0γ)>κ and Sn(η0+γ)<κ. But |ηnη0|>γ implies that Sn(η0γ)κ or Sn(η0+γ)>κ, and thus |ηnη0|γ with probability approaching 1. Thus ηnη0 in probability, and τnτ0 by the continuous mapping theorem.□

Proof of Theorem 6. This proof mirrors the proof of Lemma 5.2 in Audibert and Tsybakov [24]. It is also quite similar to the proof of Theorem 7 in Luedtke and van der Laan [13], though the proof given in that working technical report is for optimal rules without any resource constraints, and also contains several typographical errors which will be corrected in the final version.

Define Bn to be the function vQˉb,n(v)τn and B0 to be the function vQˉb,0(v)τ0. Below we omit the dependence of Bn, B0 on V in the notation and of dn, d0 on U and V. For any t>0, we have that

|R20(dn)|EPU×P0(dnd0)B0
=EPU×P0Idn/=d0B0
=EPU×P0Idn/=d0B0I(0<|B0|t)
+EPU×P0Idn/=d0B0I(|B0|>t)
EP0|BnB0|I(0<|B0|t)
+EP0|BnB0|I(|BnB0|>t)
BnB02,P0Pr0(0<|B0|t)1/2+BnB02,P02t
BnB02,P0C01/2tα/2+BnB02,P02t,

where the second inequality holds because dn/=d0 implies that |BnB0||B0| when |B0|>0, the third inequality holds by the Cauchy-Schwarz and Markov inequalities, and the C0 on the final line is the constant implied by eq. (5). The first result follows by plugging t=BnB02,P02/(2+α) into the upper bound above. We also have that

|R20(dn)|EPU×P0I(dn/=d0)|B0|
EP0I(0<|B0||BnB0|)|B0|
EP0I(0<|B0|BnB0,P0)|B0|
BnB0,P0Pr00<|B0|BnB0,P0.

By eq. (5), it follows that |R20(dn)|BnB0,P01+α.□

Published Online: 2016-5-26
Published in Print: 2016-5-1