 # Estimating Mann–Whitney-Type Causal Effects for Right-Censored Survival Outcomes

• , Chunling Liu , Shujie Ma and Min Zhang

## Abstract

Mann–Whitney-type causal effects are clinically relevant, easy to interpret, and readily applicable to a wide range of study settings. This article considers estimation of such effects when the outcome variable is a survival time subject to right censoring. We derive and discuss several methods: an outcome regression method based on a regression model for the survival outcome, an inverse probability weighting method based on models for treatment assignment and censoring, and two doubly robust methods that involve both types of models and that remain valid under correct specification of the outcome model or the other two models. The methods are compared in a simulation study and applied to an observational study of hospitalized pneumonia.

## 1 Introduction

Comparative effectiveness research frequently involves comparing one treatment (a=1) with another (a=0) with respect to a clinical outcome of interest. For a generic subject in the target population, let T(a), a{0,1}, denote the potential outcome under treatment a . The causal effect of a=1 versus a=0 is commonly assessed by comparing the means of T(1) and T(0), if they are defined. Recently, there has been growing interest in a class of Mann–Whitney-type effect measures defined as θ=E{h(Ti(1),Tj(0))}, where h(·,·) is a specified function and the subscripts i and j denote two independent subjects. A simple example of h is given by

(1)h(t1,t0)=I(t1>t0)I(t1<t0),

where I(·) is the indicator function; see, for example, Agresti [2, page 58]. Related definitions include h(t1,t0)=I(t1>t0)+0.5I(t1=t0), which forms the basis for the rank sum test  and the Mann–Whitney statistic . For an ordered categorical outcome, which does not have a mean, the Mann–Whitney-type effect based on (1) provides a natural overall effect measure. Such effect measures have been recommended for clinical trials with arbitrary (ordinal or higher level) outcomes because of their clinical relevance and interpretability [e. g., 5], [e. g., 6], [e. g., 7]. They are particularly useful in analyzing an ordinal composite outcome that combines a quantitative outcome with death and possibly treatment discontinuation due to adverse events or lack of efficacy , . As a possible drawback of such effect measures, Lumley  points out a lack of transitivity when θ based on (1) is used to order multiple treatments. On the other hand, for comparing two treatments, θ does provide unique insights that are not available from a comparison of means (if feasible at all).

In this article, we consider estimation of θ when the outcome of interest is a survival time subject to right censoring, which has not been considered in previous studies of Mann–Whitney-type causal effects. If T is a survival time, θ remains well defined and is just as meaningful as it is for another continuous outcome. In fact, the class of Mann–Whitney-type causal effects includes some well known and commonly used effect measures for survival outcomes. For example, if h(t1,t0)=I(t1t)I(t0t), then θ=F1(t)F0(t), where Fa(t)=P{T(a)t}, a=0,1. As another example, if h(t1,t0)=t1τt0τ, where ∧ denotes minimum and τ is a positive constant chosen to ensure identifiability, then θ is the difference in mean restricted survival time. A new effect measure for survival outcomes is obtained by taking h to be a truncated version of (1):

(2)h(t1,t0)=I(t1τ>t0τ)I(t1τ<t0τ),

for which θ=P{Ti(1)τ>Tj(0)τ}P{Ti(1)τ<Tj(0)τ} can be interpreted as a win-lose probability difference in comparing the restricted survival times of two randomly chosen subjects who are randomly assigned to treatments 1 and 0.

Estimation of θ for a fully observed outcome has been considered by Chen et al. , Vermeulen et al.  and Zhang et al. , with focus on adjusting for confounders in observational studies and using auxiliary information to improve efficiency in randomized clinical trials. These methods cannot be used to estimate θ for a survival outcome subject to right censoring, which represents an additional challenge. We are not aware of any existing method for estimating θ for a right-censored survival outcome, except in the aforementioned special cases where θ is a difference in distribution function , ,  or a difference in mean restricted survival time , , . In both of these special cases, h(t1,t0) is additive in the sense that it can be written as a function of t1 minus a function of t0. Here, we consider estimation of θ for a general function h, such as (2), that is not assumed to be additive. We derive and compare several methods: an outcome regression (OR) method based on a regression model for the survival outcome, an inverse probability weighting (IPW) method based on models for treatment assignment and censoring, and two doubly robust (DR) methods that involve both types of models and that remain valid under correct specification of the outcome model or the other two models. One of the DR methods is a straightforward extension of Zhang and Schaubel , and the other one is a new method based directly on the efficient influence function ,  for estimating θ.

In the next section, we describe these methods and discuss their asymptotic behavior. In Section 3, we report a simulation study and present a real application. Concluding remarks are given in Section 4. Technical details are provided in appendices.

## 2 Methodology

### 2.1 Notation and assumptions

Let A denote the actual treatment received by an individual subject in a study, which may be a randomized clinical trial or an observational study. Assuming consistency or stable unit treatment value, the actual survival time is then T=T(A)=AT(1)+(1A)T(0). Let W be a vector of relevant covariates measured at baseline (before treatment). We assume that treatment assignment is ignorable  given W in the sense that

(3)P{A=1|W,T(0),T(1)}=P(A=1|W)=:p(W).

The ignorability assumption is trivially true in a randomized clinical trial. In an observational study, the assumption requires that W contain enough information to explain any association between A and the potential outcomes. We also assume positivity:

(4)0<p(W)<1with probability 1,

which is trivially true in a clinical trial but not trivial in an observational study.

The actual survival time T may be right-censored by a censoring time C. We assume independent censoring:

(5)CT|(A,W),

where ⊥ denotes independence. The observed data for an individual subject can be represented as O=(W,A,X,Δ), where X=TC and Δ=I(TC). The observed data for the whole study will be conceptualized as independent copies of O and denoted by Oi=(Wi,Ai,Xi,Δi), i=1,,n.

Our goal is to use the observed data to estimate θ for a general (specified) function h. For identifiability, we assume that h(t1,t0) is determined by t1τ and t0τ for some τ>0 such that P(C>τ|A,W)>0 almost surely. Accordingly, we will restrict attention to the interval (0,τ] in discussions of distribution, survival and hazard functions. For theoretical reasons, we also assume that E{h(Ti(1),Tj(0))2}<, which is satisfied by all examples mentioned earlier. The efficient influence function in this estimation problem is given in Appendix A. In the rest of this section, we propose several estimators of θ, whose asymptotic properties are studied in Appendix B. Asymptotic variance formulas can be derived but may be cumbersome to use in variance estimation, depending on the specific forms of the working models involved. For ease of implementation, we use bootstrap standard errors and confidence intervals for inference.

### 2.2 Outcome regression (OR) estimator

The OR method is based on the fact that

θ=h(F1,F0)=E{h(F1(·|Wi),F0(·|Wj))},

where ij and Fa(t|W)=P{T(a)t|W} for a=0,1. Here and in the sequel, we write h(ν1,ν0)=h(t1,t0)dν1(t1)dν0(t0), h(ν1,t0)=h(t1,t0)dν1(t1) and h(t1,ν0)=h(t1,t0)dν0(t0) for any measures (ν1,ν0). Assumption (3) implies that Fa(t|W)=P(Tt|A=a,W), which, by assumptions (4) and (5), can be identified from the observed data.

To deal with the curse of dimensionality, we will assume a parametric or semiparametric model for Fa(t|W), say Fa(t|W;α), where α may be finite- or infinite-dimensional. This will be referred to as the OR model. A prominent example of a semiparametric model for Fa(t|W) is the proportional hazards model , :

λ(t|A,W;α)=λ0(t)exp(ηV),

where λ(·|A,W;α) is the conditional hazard function of T given (A,W), α=(η,λ0), λ0 is the baseline hazard function, and V is a vector-valued function of (A,W). Under this model,

Fa(t|W)=1exp{Λ(t|a,W;α)}=1exp{Λ0(t)exp(ηV)},

where Λ and Λ0 are cumulative versions of λ and λ0, respectively.

Whether Fa(t|W;α) is parametric or semiparametric, it is usually convenient to work with the corresponding hazard function λ(t|A,W;α). The likelihood for α based on the observed data may be written as

i=1nλ(Xi|Ai,Wi;α)Δiexp{Λ(Xi|Ai,Wi;α)}.

At least for parametric models and the proportional hazards model, α can be estimated by maximizing the above likelihood. Let αˆ be an estimate of α, which may be obtained by maximum likelihood or other means. Then θ can be estimated by

θˆor=1n(n1)ijh(F1(·|Wi;αˆ),F0(·|Wj;αˆ)).

If the model Fa(t|W;α) is correct and αˆ is consistent for α (in a suitable sense), then θˆor is consistent for θ under mild regularity conditions. In Appendix B, we show that n(θˆorθ) converges to a zero-mean normal distribution under general conditions. A key condition we assume is that αˆ is n-consistent and asymptotically linear in a suitable sense.

### 2.3 Inverse probability weighted (IPW) estimator

It follows from assumptions (3)–(5) and a conditioning argument that

θ=EAi(1Aj)ΔiΔjh(Xi,Xj)p(Wi){1p(Wj)}Sc(Xi|Ai,Wi)Sc(Xj|Aj,Wj)(ij),

where Sc(t|A,W)=P(C>t|A,W). If Sc(t|A,W) and p(W) are known, then the above identity suggests that θ can be estimated by

1n(n1)ijAi(1Aj)ΔiΔjh(Xi,Xj)p(Wi){1p(Wj)}Sc(Xi|Ai,Wi)Sc(Xj|Aj,Wj).

In reality, although p(W) is known by design in a clinical trial, Sc(t|A,W) is generally unknown and must be estimated. We assume a model for Sc(t|A,W), say Sc(t|A,W;β), which may be parametric or semiparametric so β may be finite- or infinite-dimensional. The model Sc(t|A,W;β) may be specified using the same considerations for specifying Fa(t|W;α) except that we are now modeling the censoring time instead of the survival time. Let βˆ be an estimate of β, which may be obtained by maximizing the likelihood:

i=1nλc(Xi|Ai,Wi;β)1Δiexp{Λc(Xi|Ai,Wi;β)},

where λc(·|A,W;β) is the conditional hazard function of C given (A,W) under the model Sc(t|A,W;β), and Λc(·|A,W;β) is the cumulative version of λc(·|A,W;β).

In an observational study, we also need to estimate p(W). In fact, even if p(W) is known, using an estimate of p(W) instead of the known value usually leads to better efficiency in the IPW approach . Let p(W) be modeled as p(W;γ), where γ is a finite- or infinite-dimensional parameter. Because A is binary, a typical choice for p(W;γ) would be a logistic regression model. Let γˆ be an estimate of γ, which may be obtained by maximizing the likelihood:

i=1np(Wi;γ)Ai{1p(Wi;γ)}1Ai.

Once βˆ and γˆ are obtained, the IPW estimator of θ is readily available as a weighted average:

θˆipw=ijAi(1Aj)ΔiΔjh(Xi,Xj)p(Wi;γˆ){1p(Wj;γˆ)}Sc(Xi|Ai,Wi;βˆ)Sc(Xj|Aj,Wj;βˆ)ijAi(1Aj)ΔiΔjp(Wi;γˆ){1p(Wj;γˆ)}Sc(Xi|Ai,Wi;βˆ)Sc(Xj|Aj,Wj;βˆ).

The denominator here serves to normalize the inverse probability weights. If the models Sc(t|A,W;β) and p(W;γ) are correct and (βˆ,γˆ) estimate (β,γ) consistently, then θˆipw is consistent for θ and asymptotically linear under appropriate conditions (see Appendix B).

### 2.4 First doubly robust (DR1) estimator

The DR1 method is adapted from the DR method of Zhang and Schaubel  for estimating the difference in mean restricted survival time. As an important by-product, Zhang and Schaubel  propose a DR estimator of Fa(t), which can be described as follows. Let us define

Ni(t)=ΔiI(Xit),Yi(t)=I(Xit),Λˆai(t)=Λ(t|a,Wi;αˆ),Λˆaic(t)=Λc(t|a,Wi;βˆ),Mˆaic(t)=I(Ai=a){(1Δi)I(Xit)Λˆaic(Xit)},Gˆai(t)=10teΛˆai(u)+Λˆaic(u)dMˆaic(u),ωai=I(Ai=a)p(Wi;γˆ)a{1p(Wi;γˆ)}a1.

Then the Zhang-Schaubel estimator of Λa(t), the cumulative hazard function of T(a), is given by

and the corresponding estimator of Fa(t) is Fˆadr(t)=1exp{Λˆadr(t)}. The superscript “dr” in these estimators indicates that the estimators are doubly robust in the sense of being consistent and asymptotically normal if (i) Fa(t|W;α) is correctly specified, (ii) Sc(t|A,W;β) and p(W;γ) are both correctly specified, or (iii) both (i) and (ii) hold.

Now θ can be estimated by θˆdr1=h(Fˆ1dr,Fˆ0dr), and it is easy to see that θˆdr1 is doubly robust in the same sense described above. Because the Fˆadr (a=0,1) are not guaranteed to be probability measures, θˆdr1 is not guaranteed to lie in the range of h. When θˆdr1 falls outside the range of h, it can be truncated into the range of h without changing its consistency or asymptotic normality. It is not immediately clear whether θˆdr1 is locally efficient, that is, whether it attains the nonparametric information bound when all three working models are correct. This is the main motivation for our developing a second DR estimator of θ based directly on the efficient influence function for estimating θ.

### 2.5 Second doubly robust (DR2) estimator

In Appendix A, we show that the efficient influence function for estimating θ is

(6)ϕeff(O)=AΔh(X,F0)p(W)Sc(X|A,W)+(1A)Δh(F1,X){1p(W)}Sc(X|A,W)2θ+{Ap(W)}{h(F1,F0(·|W))1p(W)h(F1(·|W),F0)p(W)}+Ap(W)1(t,F0|W)Sc(t|A,W)dMc(t)+1A1p(W)0(F1,t|W)Sc(t|A,W)dMc(t),

where 1(t,F0|W)=E{h(T,F0)|A=1,W,Tt}, 0(F1,t|W)=E{h(F1,T)|A=0,W,Tt}, and Mc(t)=(1Δ)I(Xt)Λc(Xt|A,W). Motivated by this result, we propose to estimate θ by setting the sample average of an empirical version of ϕeff(O) to 0. The resulting estimator is

θˆdr2=12ni=1n[AiΔih(Xi,Fˆ0dr)p(Wi;γˆ)Sc(Xi|Ai,Wi;βˆ)+(1Ai)Δih(Fˆ1dr,Xi){1p(Wi;γˆ)}Sc(Xi|Ai,Wi;βˆ)+{Aip(Wi;γˆ)}{h(Fˆ1dr,F0(·|Wi;αˆ))1p(Wi;γˆ)h(F1(·|Wi;αˆ),Fˆ0dr)p(Wi;γˆ)}+Aip(Wi;γˆ)1(t,Fˆ0dr|Wi;αˆ)Sc(t|Ai,Wi;βˆ)dMˆ1ic(t)+1Ai1p(Wi;γˆ)0(Fˆ1dr,t|Wi;αˆ)Sc(t|Ai,Wi;βˆ)dMˆ0ic(t)],

where

1(t,Fˆ0dr|Wi;αˆ)=th(u,Fˆ0dr)dF1(u|Wi;αˆ)1F1(t|Wi;αˆ),0(Fˆ1dr,t|Wi;αˆ)=th(Fˆ0dr,u)dF0(u|Wi;αˆ)1F0(t|Wi;αˆ).

The use of Fˆadr in θˆdr2 to replace Fa in (6) is an attempt to provide maximal protection against model misspecification. In Appendix B, we show that θˆdr2 is indeed doubly robust and, furthermore, locally efficient. Like θˆdr1, θˆdr2 may fall outside the range of h but can be truncated back into the range of h.

## 3 Numerical results

### 3.1 Simulation

Here we report a simulation study of the finite-sample performance of the methods described in Section 2. Data are generated according to the following mechanism:

W=(W1,W2)N(0,I),p(W)=expit(W1W2),λ(t|A,W)=exp(ηAAW1+W2),λc(t|A,W)=exp(W1+W2),

where 0=(0,0), I is the 2-by-2 identity matrix, expit(x)=1/{1+exp(x)}, and ηA is either 0 (no treatment effect) or −0.5 (protective treatment effect). The function h in the definition of θ is take to be (2) with τ=2. The true value of θ is 0 when ηA=0 and approximately 0.15 when ηA=0.5. We consider two sample sizes: n=200,500. In each case, 1,000 replicate samples are generated.

Each simulated sample is analyzed using the four methods (OR, IPW, DR1 and DR2) described in Section 2 as well as a naive method that ignores censoring and possible confounding and estimates θ with the U-statistic

θˆnv=1n0n1i=1nj=1nAi(1Aj)h(Xi,Xj),

where n1=i=1nAi and n0=nn1. The OR, IPW, DR1 and DR2 methods are implemented with correct and incorrect working models. The correct models are:

λ(t|A=a,W;αa)=λa0(t)exp((W1,W2)ηa),λc(t|A=a,W;βa)=λa0c(t)exp((W1,W2)ηac),p(W;γ)=expit((1,W1,W2)γ),

where αa=(ηa,λa0), βa=(ηac,λa0c), and λa and λac are unspecified baseline hazard functions. The incorrect models result from replacing (W1,W2) with (I(W1>0),I(W2>0)) in the correct models. All models are estimated using the maximum likelihood approach.

Table 1

Simulation results: empirical bias and standard deviation (SD) for the naive, OR, IPW, DR1 and DR2 estimators with correct and incorrect working models (see Section 3.1 for details).

 Method Models θ=0 θ≈0.15 (T|A,W) (C|A,W) (A|W) Bias SD Bias SD n=200 naive 0.255 0.077 0.173 0.078 OR correct 0.001 0.080 −0.011 0.087 OR incorrect 0.157 0.097 0.141 0.103 IPW correct correct 0.048 0.174 0.014 0.177 IPW incorrect incorrect 0.180 0.190 0.142 0.197 DR1 correct correct correct 0.008 0.138 0.001 0.146 DR1 correct incorrect incorrect −0.004 0.100 −0.013 0.106 DR1 incorrect correct correct 0.017 0.143 0.010 0.136 DR1 incorrect incorrect incorrect 0.155 0.120 0.146 0.112 DR2 correct correct correct 0.008 0.118 −0.045 0.116 DR2 correct incorrect incorrect −0.018 0.105 −0.056 0.117 DR2 incorrect correct correct 0.014 0.120 −0.038 0.108 DR2 incorrect incorrect incorrect 0.138 0.122 0.096 0.121 n=500 naive 0.256 0.050 0.172 0.048 OR correct 0.003 0.051 −0.009 0.055 OR incorrect 0.162 0.061 0.147 0.064 IPW correct correct 0.044 0.141 0.011 0.147 IPW incorrect incorrect 0.163 0.159 0.137 0.173 DR1 correct correct correct 0.014 0.082 0.002 0.104 DR1 correct incorrect incorrect −0.008 0.068 −0.019 0.077 DR1 incorrect correct correct 0.018 0.077 0.007 0.088 DR1 incorrect incorrect incorrect 0.157 0.080 0.146 0.078 DR2 correct correct correct 0.017 0.077 −0.032 0.107 DR2 correct incorrect incorrect −0.020 0.099 −0.046 0.140 DR2 incorrect correct correct 0.020 0.078 −0.028 0.091 DR2 incorrect incorrect incorrect 0.146 0.110 0.117 0.141

Table 1 shows a summary of the simulation results: empirical bias and standard deviation for estimating θ. As expected, the naive method is severely biased. The OR estimator becomes biased when the model λ(t|A,W;α) is misspecified, as does the IPW estimator when the models λc(t|A,W;β) and p(W;γ) are misspecified. In contrast, the two DR estimators are nearly unbiased unless all models are misspecified, demonstrating double robustness. Regardless of model (in)correctness, the estimators that adjust for confounding and censoring generally follow the pattern OR<DR<IPW in terms of variability. The efficiency comparison of the two DR estimators does not seem to follow a clear pattern. At n=200, DR2 appears more efficient than DR1 when all models are correct, but this difference appears to diminish with increasing sample size.

### 3.2 Application

We now apply the methods compared in Section 3.1 to a study of hospitalized pneumonia in young children, described in Section 1.13 of Klein and Moeschberger . The study is based on 3,470 newborn children from the National Longitudinal Survey of Youth . The research question is whether the mother’s feeding choice (breast feeding or not) protects the infant against hospitalized pneumonia in the first year of life. The outcome variable of interest is the time to hospitalized pneumonia, which is restricted to one year and possibly censored before one year. The available baseline covariates are infant race (black, white or other), indicators of normal birthweight (at least 5.5 pounds) and having at least one sibling, and some maternal and family characteristics: age, years of schooling, alcohol use, cigarette use, region of the country (Northeast, Northcentral, South and West), poverty, and urban environment.

We are interested in the causal effect of breast feeding on hospitalized pneumonia as measured by θ with h defined by (2). This effect measure can be interpreted as a win-lose probability difference in a hypothetical comparison of the restricted times to hospitalized pneumonia of two randomly chosen infants, who are randomly assigned to breast feeding and no breast feeding as in a clinical trial. The propensity score model in our analysis is a logistic regression model based on all infant characteristics as well as maternal age, cigarette use, and years of schooling. The OR model is specified as a proportional hazards model based on infant characteristics, maternal cigarette use, and urban environment, with separate parameters for each treatment group. The model for censoring is also a separate proportional hazards model for each treatment group, with infant race and maternal age as covariates.

Table 2

Analysis of pneumonia data: point estimates (PE) of θ and bootstrap standard errors (SE) obtained using different methods (see Section 3.2 for details).

 Method PE SE naive −0.0098 0.0169 OR −0.0020 0.0015 IPW 0.1373 0.1902 DR1 −0.0028 0.0018 DR2 0.0076 0.0023

Table 2 shows the point estimates of θ obtained using the five methods compared in Section 3.1 together with nonparametric bootstrap standard errors based on 1,000 bootstrap samples. The IPW method clearly stands out with a large positive point estimate and a large standard error, both of which are likely due to the large variability of the IPW estimator. Considering the large standard error, the IPW estimate of θ may well be due to random variation and does not establish a positive effect of breast feeding. The naive method, on the other hand, has the smallest (i. e., most negative) point estimate and also a relatively large standard error. The results are somewhat similar for the other three methods (OR, DR1 and DR2). DR2 is the only method that reaches statistical significance (with or without adjusting for multiplicity); however, the estimated effect is rather small in magnitude under the DR2 method. Taken together, the point estimates and standard errors in Table 2 provide no clear evidence for a substantial effect of breast feeding on hospitalized pneumonia.

## 4 Discussion

Mann–Whitney-type causal effects are clinically relevant, easy to interpret, and readily applicable to a wide range of study settings. This article considers estimation of such effects when the outcome variable is a survival time subject to right censoring. We have derived four methods for doing this and compared them theoretically and empirically. Among these methods, the OR and DR methods have unique advantages, and their suitability to a given application will depend on the available information and the relative importance of bias versus efficiency. If one is primarily concerned about bias, then the DR methods may be preferable. The OR method is more efficient when the OR model is correctly specified. No clear advantage has been observed for the IPW method.

The relationship between the two DR estimators remains an open question. The DR1 estimator adapted from Zhang and Schaubel  is known to be doubly robust. The proposed DR2 estimator, based directly on the efficient influence function for estimating θ, is doubly robust and locally efficient. In our simulation study, the two estimators appear to perform similarly in large samples when all working models are correctly specified, which suggests that the DR1 estimator might be locally efficient as well. Further research is warranted to clarify how the two DR estimators might relate to each other.

All of these methods assume that treatment assignment is ignorable in the sense of (3). The ignorability assumption cannot be validated with observed data and must be based on background knowledge. If there is insufficient background knowledge to justify the ignorability assumption, it is important to assess the robustness of study results in the presence of unmeasured confounders. Methods for conducting such a sensitivity analysis have been developed for some causal effects [e. g., 28], [e. g., 29] but not for the Mann–Whitney-type effects considered here. Further research is needed to close this methodological gap.

Award Identifier / Grant number: G-YBCU

Funding statement: Liu’s research was supported by Hong Kong Polytechnic University (grant # G-YBCU) and General Research Funding from Research Grants Council of Hong Kong (grant # 15327216).

## Acknowledgment

We thank the Associate Editor and the Reviewer for insightful and constructive comments on an earlier version of this paper. A part of this research was conducted while the first author was visiting the Hong Kong Polytechnic University.

## Appendix A Semiparametric theory

Here we derive the efficient influence function for estimating θ using the monotone coarsening argument of Tsiatis . Recall that the observed data for an individual subject is O=(W,A,X,Δ), which is considered a coarsened version of the “full data” Z=(W,A,T). Unless otherwise stated, the distributions of Z and O are restricted by assumptions (3)–(5) only. The full-data Hilbert space HF is the space of functions of Z with zero mean and finite variance, equipped with the covariance inner product. The observed-data Hilbert space H is the space of functions of O with zero mean and finite variance, equipped with the covariance inner product.

First, we characterize the influence function of any regular, asymptotically linear (RAL) estimator of θ based on the full data. For argument’s sake, we assume for a moment that p(W)=P(A=1|W) is known, as in a randomized clinical trial. In this special case, a nonparametric RAL estimator of θ based on the full data is given by

1n(n1)ijAi(1Aj)h(Ti,Tj)p(Wi){1p(Wj)}.

Using the theory of U-statistics [e. g., 30, Chapter 12], it is easy to see that the above estimator is RAL with influence function

ϕ0F(Z)=Ah(T,F0)p(W)+(1A)h(F1,T)1p(W)2θ.

With p(W) known, the orthogonal complement of the full-data tangent space is

Ψ1={Ap(W)}b(W):E{b(W)2}<,

and the influence function of any full-data RAL estimator of θ takes the form

ϕbF(Z)=ϕ0F(Z)+{Ap(W)}b(W)

for some function b such that E{b(W)2}<. Now we remove the assumption that p(W) is known. An RAL estimator of θ in this larger model is certainly an RAL estimator of θ in the smaller model with p(W) known. Therefore, the influence function of an RAL estimator of θ in the larger model must also belong to ϕ0F(Z)+Ψ1.

Next, we characterize the influence function of any RAL estimator of θ based on the observed data. According to Tsiatis [21, Chapter 9], such an influence function can be represented as ϕb(O)+ψ(O), where ϕb(O) satisfies

(A.1)E{ϕb(O)|Z}=ϕbF(Z)

and ψ(O) is an element of Ψ2, the augmentation space for censoring. It is easy to verify that equation (A.1) is satisfied by ϕb(O)=ϕ0(O)+{Ap(W)}b(W), where

ϕ0(O)=AΔh(X,F0)p(W)Sc(X|A,W)+(1A)Δh(F1,X){1p(W)}Sc(X|A,W)2θ.

As shown in Tsiatis [21, page 217], the augmentation space Ψ2 consists of martingale integrals of the form

q(t|A,W)Sc(t|A,W)dMc(t),

where q is an arbitrary function, Mc(t)=(1Δ)I(Xt)Λc(Xt|A,W), and Sc(·|A,W) and Λc(·|A,W) are, respectively, the conditional survival and cumulative hazard functions of C given (A,W). Thus, the influence function of any RAL estimator of θ based on the observed data must be an element of ϕ0(O)+Ψ1+Ψ2, which can be written as

ϕb,q(O)=ϕ0(O)+{Ap(W)}b(W)+q(t|A,W)Sc(t|A,W)dMc(t)

for some functions b and q.

To find the efficient influence function, the influence function with the smallest possible variance, we need to minimize the variance of ϕb,q(O) with respect to b and q, which amounts to projecting ϕ0(O) into the orthogonal complement of Ψ1+Ψ2. To this end, we note that Ψ1 and Ψ2 are orthogonal to each other, so it suffices to project ϕ0(O) into Ψ1 and Ψ2 separately. The projection of ϕ0(O) into Ψ1 is easily seen to be

Π(ϕ0(O)|Ψ1)=E(ϕ0(O)|A,W)E(ϕ0(O)|W)={Ap(W)}{h(F1(·|W),F0)p(W)h(F1,F0(·|W))1p(W)}.

Section 10.4 of Tsiatis  indicates that the projection of ϕ0(O) into Ψ2 is

Π(ϕ0(O)|Ψ2)=Ap(W)E{h(T,F0)|A,W,Tt}Sc(t|A,W)dMc(t)1A1p(W)E{h(F1,T)|A,W,Tt}Sc(t|A,W)dMc(t).

The efficient influence function is therefore

ϕeff(O)=ϕ0(O)Π(ϕ0(O)|Ψ1)Π(ϕ0(O)|Ψ2),

which is equivalent to equation (6) in the main paper.

## Appendix B Asymptotic theory

Standard regularity conditions in the M-estimation theory [e. g., 30, Chapter 5] are assumed. These include identifiability and smoothness (in parameters) of working models, existence of integrable envelopes that permit use of the dominated convergence theorem, and certain Donsker properties that help deal with random functions. Techniques for verifying the Donsker property can be found in van der Vaart and Wellner .

Let P0 denote the true distribution of O=(W,A,X,Δ), and let Pn denote the empirical distribution of Oi=(Wi,Ai,Xi,Δi), i=1,,n. Write Gn=n(PnP0) for the empirical processes. For any measures (ν1,ν0), we write

h(ν1,y0)=h(y1,y0)dν1(y1),h(y1,ν0)=h(y1,y0)dν0(y0),h(ν1,ν0)=h(y1,y0)dν1(y1)dν0(y0).

### B.1 Asymptotics for θˆor

Here we assume that the model λ(t|A,W;α) is correct and that α and αˆ take values in a suitable Banach space with

n(αˆα)=Gnϕα(O)+op(1).

For any a in the space for α, we define

kor(Wi,Wj;a)={h(F1(·|Wi;a),F0(·|Wj;a))+h(F1(·|Wj;a),F0(·|Wi;a))}/2,Uor(a)=2n(n1)i<jkor(Wi,Wj;a),uor(a)=E{Uor(a)}=E{kor(Wi,Wj;a)},ij.

Then θˆor=Uor(αˆ), θ=uor(α), and

(B.1)n(θˆorθ)=n{Uor(αˆ)uor(α)}=n{Uor(αˆ)uor(αˆ)}+n{uor(αˆ)uor(α)}.

It follows from the theory of U-statistics [e. g., 30, Theorem 12.3] that, for any fixed a,

(B.2)n{Uor(a)uor(a)}=Gn{2kor(W;a)2θ}+op(1),

where kor(w;a)=E{kor(w,W;a)}. It can be argued as in Nolan and Pollard [32, proof of Theorem 5] that the op(1) term in (B.2) is uniformly negligible for a in a neighborhood of α. This, together with Theorem 19.24 of van der Vaart , implies that

(B.3)n{Uor(αˆ)uor(αˆ)}=Gn{2kor(W;αˆ)2θ}+op(1)=Gn{2kor(W;α)2θ}+op(1).

We assume that the map auor(a) is differentiable at α with derivative Dαor. By the delta method,

(B.4)n{uor(αˆ)uor(α)}=DαorGnϕα(O)+op(1).

Substituting (B.3) and (B.4) into (B.1) yields

n(θˆorθ)=Gn2kor(W;α)2θ+Dαorϕα(O)+op(1).

### B.2 Asymptotics for θˆipw

Here we assume that the models λc(t|W;β) and p(W;γ) are correct and that β and γ take values in suitable Banach spaces with

n(βˆβ)=Gnϕβ(O)+op(1),n(γˆγ)=Gnϕγ(O)+op(1).

Let us define

k1(Oi,Oj;b,g)=Ai(1Aj)ΔiΔjh(Xi,Xj)p(Wi;g){1p(Wj;g)}Sc(Xi|Ai,Wi;b)Sc(Xj|Aj,Wj;b),k2(Oi,Oj;b,g)=Ai(1Aj)ΔiΔjp(Wi;g){1p(Wj;g)}Sc(Xi|Ai,Wi;b)Sc(Xj|Aj,Wj;b),

and, for r=1,2,

kr(Oi,Oj;b,g)={kr(Oi,Oj;b,g)+kr(Oj,Oi;b,g)}/2,Ur(b,g)=2n(n1)i<jkr(Oi,Oj;b,g),ur(b,g)=E{Ur(b,g)}=E{kr(Oi,Oj;b,g)},ij.

Then θˆipw=U1(βˆ,γˆ)/U2(βˆ,γˆ), u1(β,γ)=θ, and u2(β,γ)=1. For r=1,2, it can be argued as before that

n{Ur(βˆ,γˆ)ur(β,γ)}=n{Ur(βˆ,γˆ)ur(βˆ,γˆ)}+n{ur(βˆ,γˆ)ur(β,γ)}=n{Ur(β,γ)ur(β,γ)}+n{ur(βˆ,γˆ)ur(β,γ)}+op(1)=Gnkr(O;β,γ)+n{ur(βˆ,γˆ)ur(β,γ)}+op(1),

where kr(o;b,g)=E{kr(o,O;b,g)}. It is straightforward to see that

k1(O;β,γ)=ΔSc(X|A,W;β){Ah(X,F0)p(W;γ)+(1A)h(F1,X)1p(W;γ)},k2(O;β,γ)=ΔSc(X|A,W;β){Ap(W;γ)+1A1p(W;γ)}.

For r=1,2, we assume that the map (b,g)ur(b,g) is differentiable at (β,γ) with derivative Dripw=(Dr,βipw,Dr,γipw), and use the delta method to deduce that

n{ur(βˆ,γˆ)ur(β,γ)}=Dr,βipwGnϕβ(O)+Dr,γipwGnϕγ(O)+op(1),

which implies that

n{Ur(βˆ,γˆ)ur(β,γ)}=Gn{kr(O;β,γ)+Dr,βipwϕβ(O)+Dr,γipwϕγ(O)}+op(1).

Applying the delta method once again to θˆipw=U1(βˆ,γˆ)/U2(βˆ,γˆ), we see that

n(θˆipwθ)=Gn[k1(O;β,γ)θk2(O;β,γ)+(D1,βipwθD2,βipw)ϕβ(O)+(D1,γipwθD2,γipw)ϕγ(O)]+op(1).

### B.3 Asymptotics for θˆdr1

Here we outline a general approach to the analysis of θˆdr1=h(Fˆ1dr,Fˆ0dr) using the functional delta method. We identify F1, F0 and their estimates as elements of BV, the space of real-valued functions of bounded variation equipped with the total variation norm. Assuming that h(t1,t0) is bounded, we have

h(F1+δF1,F0+δF0)h(F1,F0)=h(δF1,F0)+h(F1,δF0)+h(δF1,δF0)=h(δF1,F0)+h(F1,δF0)+o(δF1+δF0).

Therefore, the mapping of (ν1,ν0)BV2 to h(ν1,ν0)R is Frechet-differentiable at (F1,F0) with derivative

(δF1,δF0)h(δF1,F0)+h(F1,δF0).

For specific working models, Zhang and Schaubel  show that Fˆadr(t) (a=0,1) converges in probability to Fa(t) uniformly in t(0,τ] if (i) λ(t|A,W;α) is correctly specified, (ii) λc(t|A,W;β) and p(W;γ) are both correctly specified, or (iii) both (i) and (ii) hold. For general working models that satisfy (i) or (ii), we assume that n(FˆadrFa) converges weakly to a random element in BV with

for some influence function ϕa(O,t). This is generally true for parametric working models. With λ(t|A,W;α) and λc(t|A,W;β) specified as semiparametric proportional hazards models, an explicit expression for ϕa(O,t) is given in Zhang and Schaubel [19, Web-Based Supplementary Materials]. Now it follows from Theorem 20.8 of van der Vaart  and simple algebra that

n(θˆθ)=Gn{h(ϕ1(O,·),F0)+h(F1,ϕ0(O,·))}+op(1).

### B.4 Asymptotics for θˆdr2

Here we assume that (i) λ(t|A,W;α) is correctly specified, (ii) λc(t|A,W;β) and p(W;γ) are both correctly specified, or (iii) both (i) and (ii) hold. With possibly misspecified models, we assume that (αˆ,βˆ,γˆ) converges in probability to some (α,β,γ) with

n(αˆα)=Gnϕα(O)+op(1),n(βˆβ)=Gnϕβ(O)+op(1),n(γˆγ)=Gnϕγ(O)+op(1).

We first demonstrate the consistency of θˆdr2 under condition (i) or (ii). Write Fˆadr=Fadr(·;αˆ,βˆ,γˆ). Under mild regularity conditions, θˆdr2 converges in probability to ϑ(α,β,γ), where

ϑ(a,b,g)=12E[AΔh(X,F0dr(·;a,b,g))p(W;g)Sc(X|A,W;b)+(1A)Δh(F1dr(·;a,b,g),X){1p(W;g)}Sc(X|A,W;b)+{Ap(W;g)}{h(F1dr(·;a,b,g),F0(·|W;a))1p(W;g)h(F1(·|W;a),F0)p(W;g)}+Ap(W;g)1(t,F0dr(·;a,b,g)|W;a)Sc(t|A,W;b)dMc(t;b)+1A1p(W;g)0(F1dr(·;a,b,g),t|W;a)Sc(t|A,W;b)dMc(t;b)],

with Mc(t;b)=(1Δ)I(Xt)Λc(Xt|A,W;b). Zhang and Schaubel  have shown that Fadr(·;α,β,γ)=Fa under condition (i) or (ii). Under condition (ii), (β,γ)=(β,γ), p(W;γ)=p(W), Sc(X|A,W;β)=Sc(X|A,W), Mc(t;β)=Mc(t), and θˆdr2 converges to

(B.5)ϑ(α,β,γ)=12E[AΔh(X,F0)p(W)Sc(X|A,W)+(1A)Δh(F1,X){1p(W)}Sc(X|A,W)]
(B.6)+12E[{Ap(W)}{h(F1,F0(·|W;α))1p(W)}]
(B.7)12E[{Ap(W)}{h(F1(·|W;α),F0)p(W)}]
(B.8)+12E[Ap(W)1(t,F0|W;α)Sc(t|A,W)dMc(t)]
(B.9)+12E[1A1p(W)0(F1,t|W;α)Sc(t|A,W)dMc(t)],
It follows from a conditioning argument that term (B.5) is equal to θ. Terms (B.6) and (B.7) are 0 because Ap(W) times any function of W alone has mean 0. Terms (B.8) and (B.9) are 0 because Mc is a martingale. Thus, θˆdr2 is consistent under condition (ii). Next, suppose condition (i) holds, so that α=α, Fa(·|W;α)=Fa(·|W) (a=0,1), 1(t,F0|W;α)=1(t,F0|W), 0(F1,t|W;α)=0(F1,t|W), and θˆdr2 converges to

(B.10)ϑ(α,β,γ)=12E[AΔh(X,F0)p(W;γ)Sc(X|A,W;β)+(1A)Δh(F1,X){1p(W;γ)}Sc(X|A,W;β)Ah(F1(·|W),F0)p(W;γ)(1A)h(F1,F0(·|W))1p(W;γ)+h(F1(·|W),F0)+h(F1,F0(·|W))+Ap(W;γ)1(t,F0|W)Sc(t|A,W;β)dMc(t;β)+1A1p(W;γ)0(F1,t|W)Sc(t|A,W;β)dMc(t;β)].

Note that

(B.11)E{h(F1(·|W),F0)}=E{h(F1,F0(·|W))}=θ.

It follows from Tsiatis [21, Section 9.3 and Lemma 10.4] that

(B.12)ΔSc(X|A,W;β)=1dMc(t;β)Sc(t|A,W;β).

Substituting (B.11) and (B.12) into (B.10) leads to

(B.13)ϑ(α,β,γ)θ=12E[A{h(X,F0)h(F1(·|W),F0)}p(W;γ)]
(B.14)+12E[(1A){h(F1,X)h(F1,F0(·|W))}1p(W;γ)]
(B.15)+12E[Ap(W;γ)1(t,F0|W)h(X,F0)Sc(t|A,W;β)dMc(t;β)]
(B.16)+12E[1A1p(W;γ)0(F1,t|W)h(F1,X)Sc(t|A,W;β)dMc(t;β)].
It follows from a conditioning argument that terms (B.13) and (B.14) are 0. Terms (B.15) and (B.16) can be shown to be 0 using the arguments of Zhang and Schaubel [19, Web-Based Supplementary Materials]. Thus, θˆdr2 is consistent under condition (i).

Next, we show that θˆdr2 is asymptotically normal. Define

φ(O;a,b,g)=12[AΔh(X,F0dr(·;a,b,g))p(W;g)Sc(X|A,W;b)+(1A)Δh(F1dr(·;a,b,g),X){1p(W;g)}Sc(X|A,W;b)+{Ap(W;g)}{h(F1dr(·;a,b,g),F0(·|W;a))1p(W;g)h(F1(·|W;a),F0dr(·;a,b,g))p(W;g)}+Ap(W;g)1(t,F0|W;a)Sc(t|A,W;b)dMc(t;b)+1A1p(W;g)0(F1,t|W;a)Sc(t|A,W;b)dMc(t;b)].

Then θˆdr2=Pnφ(O;αˆ,βˆ,γˆ), θ=P0φ(O;α,β,γ), and it can be argued as before that

n(θˆdr2θ)=Gnφ(O;αˆ,βˆ,γˆ)+n{P0φ(O;αˆ,βˆ,γˆ)P0φ(O;α,β,γ)}=Gnφ(O;α,β,γ)+n{ϑ(αˆ,βˆ,γˆ)ϑ(α,β,γ)}+op(1).

We assume that ϑ is differentiable at (α,β,γ) with derivative Ddr2=(Dαdr2,Dβdr2,Dγdr2), which implies that

n{ϑ(αˆ,βˆ,γˆ)ϑ(α,β,γ)}=Ddr2n(αˆα,βˆβ,γˆγ)+op(1)=Gn{Dαdr2ϕα(O)+Dβdr2ϕβ(O)+Dγdr2ϕγ(O)}+op(1).

It follows that

(B.17)n(θˆdr2θ)=Gn{φ(O;α,β,γ)+Dαdr2ϕα(O)+Dβdr2ϕβ(O)+Dγdr2ϕγ(O)}+op(1).

Finally, we show that θˆdr2 is locally efficient, that is, the influence function of θˆdr2 is equal to ϕeff(O)=φ(O;α,β,γ) when all three models are correct. We do so by showing that Ddr2=0 under conditions (i) and (ii). Under condition (ii), ϑ(a,β,γ)=θ for any a, and its partial derivative, Dαdr2, must be zero. Similarly, we can show that Dβdr2 and Dγdr2 are both zero under condition (i). Therefore, when all three models are correct, Ddr2=0 and

n(θˆdr2θ)=Gnφ(O;α,β,γ)+op(1)=Gnϕeff(O)+op(1).

## References

1. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.10.1037/h0037350Search in Google Scholar

2. Agresti A. Categorical data analysis. 3rd ed. Hoboken, NJ: John Wiley and Sons; 2013.Search in Google Scholar

3. Wilcoxon F. Individual comparisons by ranking methods. Biometrics. 1945;1:80–3.10.1007/978-1-4612-4380-9_16Search in Google Scholar

4. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50–60.10.1214/aoms/1177730491Search in Google Scholar

5. Acion L, Peterson JJ, Temple S, Arndt S. Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Stat Med. 2006;25:591–602.10.1002/sim.2256Search in Google Scholar PubMed

6. Brumback LC, Pepe MS, Alonzo TA. Using the ROC curve for gauging treatment effect in clinical trials. Stat Med. 2006;25:575–90.10.1002/sim.2345Search in Google Scholar PubMed

7. Newcombe RG. Confidence intervals for an effect size measure based on the Mann–Whitney statistic. Part 1: General issues and tail-area-based methods. Stat Med. 2006;25:543–57.10.1002/sim.2323Search in Google Scholar PubMed

8. Wang C, Scharfstein DO, Colantuoni E, Girard TD, Yan Y. Inference in randomized trials with death and missingness. Biometrics. 2017;73:431–40.10.1111/biom.12594Search in Google Scholar PubMed PubMed Central

9. Liu W, Zhang Z, Nie L, Soon G. A case study in personalized medicine: rilpivirine versus efavirenz for treatment-naive HIV patients. J Am Stat Assoc. 2017;112:1381–92.10.1080/01621459.2017.1280404Search in Google Scholar

10. Lumley T. Good, better, worst: what do rank tests really test? Presented at Canterbury Statistics Day, 2012. Available online at http://www.math.canterbury.ac.nz/canterbury-tails/download/68/Lumley—Plant-and-Food.pdf. 2012.Search in Google Scholar

11. Chen SX, Qin J, Tang CY. Mann-Whitney test with adjustments to pretreatment variables for missing values and observational study. J R Stat Soc, Ser B, Stat Methodol. 2013;75:81–102.10.1111/j.1467-9868.2012.01036.xSearch in Google Scholar

12. Vermeulen K, Thas O, Vansteelandt S. Increasing the power of the Mann–Whitney test in randomized experiments through flexible covariate adjustment. Stat Med. 2015;34:1012–30.10.1002/sim.6386Search in Google Scholar PubMed

13. Zhang Z, Ma S, Shen C, Liu C. Estimating Mann–Whitney-type causal effects. 2018. Under review.10.1515/jci-2018-0010Search in Google Scholar

14. Hubbard A, van der Laan MJ, Robins JM. Nonparametric locally efficient estimation of the treatment specific survival distribution with right censored data and covariates in observational studies. In: Halloran E, Berry D, editors. Statistical models in epidemiology: the environment and clinical trials. New York: Springer; 1999. p. 135–78.10.1007/978-1-4612-1284-3_3Search in Google Scholar

15. Zhang M, Schaubel DE. Contrasting treatment-specific survival using double-robust estimators. Stat Med. 2012;31:4255–68.10.1002/sim.5511Search in Google Scholar PubMed PubMed Central

16. Zhang M. Robust methods to improve efficiency and reduce bias in estimating survival curves in randomized clinical trials. Lifetime Data Anal. 2015;21:119–37.10.1007/s10985-014-9291-ySearch in Google Scholar PubMed

17. Chen P, Tsiatis AA. Causal inference on the difference of the restricted mean life between two groups. Biometrics. 2001;57:1030–8.10.1111/j.0006-341X.2001.01030.xSearch in Google Scholar

18. Zhang M, Schaubel DE. Estimating differences in restricted mean lifetime using observational data subject to dependent censoring. Biometrics. 2011;67:740–9.10.1111/j.1541-0420.2010.01503.xSearch in Google Scholar PubMed PubMed Central

19. Zhang M, Schaubel DE. Double-robust semiparametric estimator for differences in restricted mean lifetimes using observational data. Biometrics. 2012;68:999–1009.10.1111/j.1541-0420.2012.01759.xSearch in Google Scholar PubMed PubMed Central

20. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press; 1993.Search in Google Scholar

21. Tsiatis AA. Semiparametric theory and missing data. New York: Springer; 2006.Search in Google Scholar

22. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.10.21236/ADA114514Search in Google Scholar

23. Cox DR. Regression models and life tables (with discussion). J R Stat Soc, Ser B, Stat Methodol. 1972;34:187–200.Search in Google Scholar

24. Cox DR. Partial likelihood. Biometrika. 1975;62:269–75.10.1093/biomet/62.2.269Search in Google Scholar

25. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–66.10.1080/01621459.1994.10476818Search in Google Scholar

26. Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. 2nd ed. New York: Springer; 2003.10.1007/b97377Search in Google Scholar

27. National Longitudinal Survey of Youth. NLS Handbook. Columbus, Ohio: Center for Human Resource Research, Ohio State University; 1995.Search in Google Scholar

28. Rosenbaum PR, Rubin DB. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J R Stat Soc, Ser B, Stat Methodol. 1983;45:212–8.10.1017/CBO9780511810725.017Search in Google Scholar

29. VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med. 2017;167:268–74.10.7326/M16-2607Search in Google Scholar PubMed

30. van der Vaart AW. Asymptotic statistics. Cambridge, UK: Cambridge University Press; 1998.10.1017/CBO9780511802256Search in Google Scholar

31. van der Vaart AW, Wellner JA. Weak convergence and empirical processes with applications to statistics. New York: Springer; 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

32. Nolan D, Pollard D. Functional limit theorems for U-processes. Ann Probab. 1988;16:1291–8.10.1214/aop/1176991691Search in Google Scholar 