Doubly weighted M-estimation for nonrandom assignment and missing outcomes

This paper proposes a new class of M-estimators that double weight for the twin problems of nonrandom treatment assignment and missing outcomes, both of which are common issues in the treatment effects literature. The proposed class is characterized by a `robustness' property, which makes it resilient to parametric misspecification in either a conditional model of interest (for example, mean or quantile function) or the two weighting functions. As leading applications, the paper discusses estimation of two specific causal parameters; average and quantile treatment effects (ATE, QTEs), which can be expressed as functions of the doubly weighted estimator, under misspecification of the framework's parametric components. With respect to the ATE, this paper shows that the proposed estimator is doubly robust even in the presence of missing outcomes. Finally, to demonstrate the estimator's viability in empirical settings, it is applied to Calonico and Smith (2017)'s reconstructed sample from the National Supported Work training program.


Introduction
When interest lies in causal inference, the prevalence of missing data poses a major identification challenge. A common issue is that the outcome of interest is missing for some proportion of the sample. In this case, the complete data method that drops observations with missing outcomes is widely used. While dropping is practically convenient, it not only leads to substantial loss of information but more importantly creates a nonrandom sample for estimation. In turn, dropping can generally lead to inconsistent treatment effect estimates. This paper proposes an estimator that double weights for the twin problems of nonrandom treatment assignment and missing outcomes by using information on covariates.
Weighting has been used extensively in both the missing data [Horvitz and Thompson (1952), Robins et al. (1994), Robins and Rotnitzky (1995), Wooldridge (2007)] and treatment effect [Rosenbaum and Rubin (1983), Hahn (1998), Hirano and Imbens (2001), Firpo (2007), Słoczyński and Wooldridge (2018)] literatures. However, a weighting approach that corrects for general missingness in the outcome to estimate treatment effects using observational data is yet to be proposed. Previous studies have considered weighting to deal with specific missing data issues such as attrition and non-response in the presence of endogenous treatment selection [Frölich and Huber (2014), Huber (2014), Fricke et al. (2020)]. Typically, the identification argument in these papers is based on one or more instruments with discussion centered around estimation of average treatment effects.
This paper introduces inverse probability weighting alongside propensity score (PS) weighting in a general M-estimation framework to address two prevalent problems in the causal inference literature. Moreover, the objective function being solved is permitted to be non-smooth in the underlying parameters thereby covering both average and quantile treatment effects. A key feature of the proposed estimator is its robustness to parametric misspecification in either a conditional model of interest (such as mean or quantile) or the two weighting functions. In addition, the ATE estimator which uses the proposed strategy is shown to be 'doubly robust' [Słoczyński and Wooldridge (2018)] even in the presence of missing outcomes.
The key identifying assumptions for consistency of the doubly weighted estimator of a population level parameter are unconfoundedness 1 and missing at random. Put differently, the two restrictions imply that the treatment assignment and missing outcomes mechanisms are as good as randomly assigned after conditioning on covariates. With respect to missingness, the mechanism also allows sample observability to depend on the treatment status. As such it allows for differential non-response, attrition, and even non-compliance to the extent that conditioning variables predict it.
For many observational studies, unconfoundedness may be a reasonable assumption. Previous literature has found several situations where such an assumption is tenable, especially when pretreatment values of the outcome variable are available. For example, LaLonde (1986) and Hotz et al. (2006) have shown that controlling for pre-training earnings alone reduces significant bias between non-experimental and experimental estimates. The literature assessing teacher impact on student achievement has reported similar findings with pre-test scores [Chetty et al. (2014), Kane andStaiger (2008), andShadish et al. (2008)], indicating the plausibility of unconfoundedness in such settings.
Estimation then follows in two steps. The first step estimates the treatment and missing outcome probabilities using binary response maximum likelihood 2 and second step plugs in the estimated probabilities as weights to solve a general objective function. Given the parametric nature of the first and second steps, this paper highlights a robustness property which allows the estimator to remain consistent for a parameter of interest under misspecification of either a conditional model or the two probability weights. Consequently, the asymptotic theory in this paper distinguishes between these two halves. The first half focuses on misspecification of either a conditional expectation function (CEF) or a conditional quantile function (CQF), whereas the second half considers misspecification in the weighting functions.
As illustrative examples, the paper discusses robust estimation of two specific causal parameters, namely, the ATE and QTEs, expressed as functions of the doubly weighted estimator. Consistent estimation of the ATE is achievable under both misspecification scenarios. Of particular interest is the case when the conditional mean function is misspecified. For estimation of quantile treatment effects, the paper considers three different parameters, namely, conditional quantile treatment effect (CQTE), a linear approximation to CQTE, and unconditional quantile treatment effect (UQTE), each of which may be of interest to the researcher depending on whether features of the conditional or unconditional outcomes distribution are of interest. Simulations show that the doubly weighted ATE and QTE estimates have the lowest finite sample bias compared to alternatives which ignore one or both problems. 3 Finally, the proposed method is applied to estimate average and distributional impacts of the National Supported Work (NSW) training program on earnings for the Aid to Families with Dependent Children (AFDC) target group. The sample is obtained from Calónico and Smith (2017) who recreate Lalonde's within-study analysis for the AFDC women. The idea behind choosing this empirical application is to utilize the presence of experimental and non-experimental comparison groups for evaluating whether the strategy of double weighting brings us close the experimental benchmark relative to other alternatives. The paper finds that the empirical bias for the doubly weighted estimate is much smaller than that for the unweighted estimate.
The rest of this paper is structured as follows. Section 2 describes the basic potential outcomes framework and provides a short description of the population models with an introduction to the naive unweighted estimator. Section 3 discusses the treatment assignment and missing outcome mechanisms which leads us directly to the identification lemma. Section 4 develops the first half of the asymptotic theory for the doubly weighted estimator with a focus on misspecification of a conditional feature of interest. This half also requires the weights to be correct for delivering parameter identification. In contrast, section 5 considers the other half where a conditional model of interest is correctly specified but the weights may be misspecified. Identification here relies on the parameter solving a conditional problem. Section 6 studies the specifics of robustness for estimating ATE and QTEs in rigorous detail. Section 7 provides supporting Monte Carlo evidence under three interesting cases of misspecification; correct conditional model with misspecified weights, misspecified conditional model with correct weights, and misspecified model and weights. Section 8 applies the proposed method to job training data from Calónico and Smith (2017) and section 9 concludes with directions for future research.

Potential outcomes and the population models
Consider the standard Neyman-Rubin causal model. Let Y (1) and Y (0) denote potential outcomes corresponding to the treatment and control states and let W be an indicator for whether an individual received the treatment. Then observed outcome is Also, let X be a vector of pre-treatment characteristics which includes an intercept. 4 Some feature of the distribution of (Y (g), X) ⊂ R M is assumed to depend on a finite P g × 1 vector θ g , contained in a parameter space Θ g ⊂ R Pg . 5 Let q(Y (g), X, θ g ) be an objective function that depends on outcomes, covariates, and the parameter vector, θ g . Then, the parameter of interest is defined to be a solution to the following M-estimation problem.
Assumption 1. (Identification of θ 0 g ) The parameter vector θ 0 g ∈ Θ g is a unique solution to the population minimization problem for each g = 0, 1.
An implicit point in assumption 1 is that θ 0 g is not assumed to be correctly specified for a conditional feature like a conditional mean, variance, or even the full conditional distribution. It simply requires θ 0 g to uniquely minimize the population problem in (2). If θ 0 g is correctly specified for any of the above mentioned quantities, then the parameter is of direct interest to researchers. However, if θ 0 g is misspecified for any of these distributional features, assumption 1 guarantees a unique pseudo true solution, θ * g [White (1982)]. In the case of misspecification, determining whether θ * g is meaningful will depend on the conditional feature being studied and the estimation method used. For example, in the case of OLS, θ 0 g will index a linear projection if one is agnostic about linearity of the CEF. Angrist et al. (2006) establish analogous approximation properties for quantiles, where a misspecified CQF can still provide the best weighted mean square approximation to the true CQF. 4 As mentioned in Negi and Wooldridge (2020), X may include functions of covariates such as levels, squares, and interactions which will be chosen by the researcher. The dimension of the covariate vector is assumed fixed and does not grow with the sample size. 5 For generality, the dimension of θ g is allowed to be different for the treatment and control group problems and is also different than the dimension of X, where X ∈ X ⊂ R dim(X) 6 For a random variable u, c τ (u) = (τ − 1{u < 0})u is the asymmetric loss function for estimating quantiles and 1{·} is an indicator function.
Let 'S' be a binary indicator such that S = 1 if the outcome is observed and S = 0 otherwise. The objective of this paper is to consistently estimate θ 0 g . In the presence of missing outcomes, a common empirical strategy is to solve the following M-estimation problems for the treatment and control groups, respectively.
Let us refer to the estimator that solves (3) as the unweighted M-estimator and denote it asθ u g . This estimator uses the available sample after dropping the missing data to estimate θ 0 g . Using the reverse analogy principle,θ u g will be consistent for θ 0 g if it solves the population analogue of (3), which may not be true. As an example, consider In this case, even if the treatment is randomly assigned, missingness may still be correlated with the treatment, observable factors, or both. Hence, the population first order condition for the selected sample, E[S · W · X U (g)], is not zero even though E[X U (g)] = 0. So identification of θ 0 g is now confounded on two grounds; nonrandom assignment which renders the treatment and control groups incomparable and missing outcomes which violates the 'random sampling' assumption. The next section discusses the identification approach taken in this paper.

Identification of parameter of interest
Without imposing any structure on the assignment and missingness mechanisms in the population, estimating θ 0 g remains difficult. To proceed with identification, I assume that the treatment is unconfounded on covariates. 7 Formally, Assumption 2. (Strong ignorability) Assume, i) The vector of pre-treatment covariates, X, is always observed for the entire sample.
Equation (4) indicates that conditioning on covariates is enough to parse out any systematic differences that may exist between the treatment and control groups. One advantage of unconfoundedness is that, intuitively, it has a better chance of holding once we control for a rich set of variables in X. 8 Note that unconfoundedness not only includes cases where the treatment is a deterministic function of the covariates, for example stratified (or block) experiments, but also cases where the treatment is a stochastic function of covariates. Part i) requires that we observe these covariates for all individuals. Part ii) is an overlap condition which ensures that for all values of x in X, we observe units in both the treatment and control groups. 9 With respect to the missing outcomes mechanism, I assume selection on observables Assumption 3. (Missing at Random (MAR)) Assume, i) In addition to X, W is always observed for the entire sample.
Equation (5) states that conditional on covariates and the treatment status, the individuals whose outcomes are missing do not differ systematically from those who are observed. This implies that adjusting for X and W renders the outcomes as good as randomly missing. In the statistics literature, this assumption is known as MAR and represents a mechanism wherein missingness only depends on observables and not on the missing values of the variable itself [Little and Rubin (2019)]. Special cases covered under this mechanism are patterns such as missing completely at random (MCAR) and exogenous missingness considered in Wooldridge (2007). Allowing the missingness probability to be a function of the treatment indicator is particularly useful in cases of differential nonresponse. For instance, in NSW, people assigned to the treatment group were less likely to drop out of the program compared to the control group. In such cases, covariates alone may not be sufficient for predicting missingness. To the extent that being observed in the sample is predicted by X and W , assumption 3 can accommodate non-observability due to sampling design, item non-response, and attrition in a two period panel. 10 Part i) of the above assumption ensures that X and W are fully observed and part ii) again imposes an overlap condition. It states that there is a positive probability of observing people in the sample for a given X and W .
Then solving the doubly weighted population problem given below is the same as solving the original M-estimation problem in (2). The following lemma establishes this equality where ω 1 = S · W r(X, W ) · p(X) , .
The proof uses two applications of the law of iterated expectations (LIEs) with unconfoundedness and MAR to arrive at the above result. It implies that one can now address the identification issue due to nonrandom assignment and missing outcomes by solving the doubly weighted population problem. 11 4 Asymptotic theory under weak identification Lemma 1 is important for us as it helps to illustrate the role of double weighting in dealing with the two issues at hand. However, to operationalize this argument, we first need to estimate r(X, W ) and p(X) before introducing the estimator and studying its asymptotic properties.
The following assumptions posit that we have a correctly specified model for the two probabilities and that we estimate them using binary response maximum likelihood. Since both W and S are binary responses, estimation of γ 0 and δ 0 using MLE will be asymptotically efficient under correct specification of these functions. Consistency and asymptotic normality for γ 0 and δ 0 follow from theorems 2.5 and 3.3 of Newey and McFadden (1994).
Assumption 4. (Correct parametric specification of propensity score) Assume that i) There exists a known parametric function G(X, γ) for p(X) where γ ∈ Γ ⊂ R I and 0 < G(X, γ) < 1 for all X ∈ X , γ ∈ Γ; ii) There exists γ 0 ∈ Γ s.t. p(X) = G(X, γ 0 ); iii)γ is the binary response maximum likelihood estimator that solves Assumption 5. (Correct parametric specification of missing outcomes probability) Assume that i) There exists a known parametric function R(X, W, δ) for r(X, W ) where δ ∈ ∆ ⊂ R K and R(X, W, δ) > 0 for all X ∈ X , δ ∈ ∆; ii) There exists δ 0 ∈ ∆ s.t. r(X, W ) = R(X, W, δ 0 ); iii) δ is the binary response maximum likelihood estimator that solves The influence function representations forγ andδ can then be written as 11 Define q (Y, X, θ) = q(Y (1), X, θ 1 ) for W = 1 and q(Y (0), X, where d i and b i are scores of the binary response log-likelihood problems in (7) and (8) evaluated at the probability limits γ 0 and δ 0 , respectively. The doubly weighted estimator is then defined as: where ) are the estimated weights for solving the treatment and control group problems, respectively. 12 Given the two-step nature of the estimation problem; first step uses binary response MLE for estimating the probability weights and second step solves an objective function using the first-step weights, the asymptotic theory utilizes results for two-step estimators with a non-smooth objective function to establish the large sample properties ofθ g . The following theorem fills in the primitive regularity conditions for applying the uniform law of large numbers.
Theorem 1. (Consistency) Suppose assumption 1 holds and that i) {(Y i , X i , W i , S i ); i = 1, 2, . . . , N } are i.i.d draws satisfying assumptions 2 and 3; ii) Θ g is compact for g = 0, 1; iii) G(X, γ) satisfies assumption 4 and is continuous for each γ on the support of X. Similarly, R(X, W, δ) satisfies assumption 5 and is continuous for each δ on the support of (X, W ); iv) q(Y (g), X, θ g ) g . The proof follows from verifying the conditions in Lemma 2.4 of Newey and McFadden (1994). Under the dominance condition given in v), uniform convergence of sample averages holds quite generally.
For establishing asymptotic normality, I provide primitive conditions for the general case of non-smooth objective functions. Let the score of q(Y (g), X, θ g ) at the true parameter, θ 0 g , be denoted as h(Y (g), X, θ 0 g ) ≡ h g and suppose it exists with probability one. Let the population problem be denoted as Q 0 (θ g ) ≡ E ω g · q(Y (g), X, θ g ) and the sample analogue be given as whereρ g = N g /N and Nρ g → ∞ asρ g → ρ g . 13 For the sake of asymptotics, we may ignore the division byρ g . The main condition needed for establishing asymptotic normality is stochastic equicontinuity of the empirical process 12 When necessary, the estimated weights will also be denoted as ω g (δ,γ) ≡ ω g . 13 The sampling fractions are random which implies that N = N 0 + N 1 is also random as opposed to being fixed ahead of time.
which will be sufficient to guarantee uniform convergence of the objective function to its population counterpart.
Theorem 2. (Asymptotic Normality) In addition to the conditions mentioned in Theorem 1, as- v) G(·, γ) and R(·, δ) are both twice continuously differentiable on int(Γ) and int(∆), respec- for each g = 0, 1 and l ig ≡ ω ig h ig is score of the weighted objective function evaluated at θ 0 g .
Sufficient primitive conditions for stochastic equicontinuity may be found in Andrews (1994). The asymptotic variance expression derived above offers some interesting insights. First, the middle term, Ω g , represents the variance of the residual from the population regression of the weighted score, l ig , on the two binary response scores, b i and d i . Note that even though Ω g would involve covariance between the two MLE scores, that term is zero on account of the two scores being conditionally independent.
Second, the expression for Ω g has an efficiency implication for the second step estimate,θ g . When a researcher is only willing to assume identification of θ 0 g in the unconditional sense, it is potentially more efficient to estimate the two weights even when they are known. To show this formally, let us assume that p(X) and r(X, W ) are known andθ g is the doubly weighted estimator that uses known weights, ω g . Then, Corollary 1. (Efficiency gain with estimated weights) Under the assumptions of theorem 2, In other words, we do no worse, asymptotically, by estimating the weights even when we actually know them. This result can be seen an extension of Wooldridge (2007) to the case when one has two sets of probability weights being estimated in the first stage. 14 14 In the missing data literature, this result has also been called the "efficiency puzzle". Prokhorov and Schmidt (2009) study this puzzle in a GMM framework using an augmented set of moment conditions, where the first set of moments correspond to the weighted objective function and the second set belongs to the missing outcomes (or selection in their case) problem.

A conditional feature of interest is correctly specified
The asymptotic results in the previous section were derived under the assumption that some feature of the conditional distribution of outcomes may be misspecified. This was implicit in defining θ 0 g as a solution to the unconditional M-estimation problem. Examples include estimating a misspecified linear conditional mean or quantile function. In contrast, this section highlights the other half of the asymptotic theory which is formalized using a strong version of the identification assumption and allowing the weights to be misspecified.
Assumption 6. (Strong identification of θ 0 g ) The parameter vector θ 0 g ∈ Θ g is the unique solution to the population minimization problem under unconfoundedness (defined in 2) and MAR (defined in 3) for each X ∈ X ⊂ R dim(X) .
The above can be seen as a strengthening of the identification assumption in section 4 since LIE implies that θ 0 g is also a solution to the unconditional M-estimation problem. By requiring θ 0 g to solve (12), assumption 5 is intended for situations where a conditional feature of interest is correctly specified. An implication of this strengthened identification is that θ 0 g now solves the conditional score of the objective function i.e. E h(Y (g), X, θ 0 g )|X = 0. For instance, the conditional score will be zero in the case of estimating a correctly specified CEF with either OLS or quasi maximum likelihood estimation (QMLE) in the linear exponential family (LEF). This would also hold for a correctly specified CQF estimated either using quantile regression or QMLE in the tick exponential family [Komunjer (2005)].
Delineating these two identification scenarios is important for determining which causal parameter can be estimated consistently under each setting. As we will see in the next section, it is possible to estimate the ATE under both cases of misspecification. However the same cannot be said for QTE parameters. In addition to assumption 6, the asymptotic results in this half do not rely on correct specification of weights. In other words, assuming R(·, ·, δ) and G(·, γ) to be correctly specified is rather restrictive and not required for the doubly weighted estimator to be consistent for θ 0 g . Assumption 7. (Parametric specification of propensity score) Assume that conditions i) and iii) of assumption 4 hold where condition ii) is defined for some γ * ∈ Γ such that plim(γ) = γ * .
Assumption 8. (Parametric specification of missingness probability) Assume that conditions i) and iii) of assumption 5 hold where condition ii) is defined for some δ * ∈ ∆ such that plim(δ) = δ * .
Note that assumptions 7 and 8 do not require the parametric models for the two probabilities to be correctly specified. Nevertheless, we continue to assume thatγ andδ solve the same binary response problem as in Assumptions 4 and 5 with probability limits given by pseudo true values γ * and δ * , respectively [White (1982)]. To show that θ 0 g is still a solution to the doubly weighted population problem with misspecified weights, a sketch of the argument is given below. Consider, where ω * g are asymptotic weights which use G(X, γ * ) and R(X, W, δ * ). Using LIE along with unconfoundedness and MAR, I can rewrite the above expectation as where ξ g (X) is a function of weights for g = 0, 1. The strong identification assumption implies where the inequality is strict when θ g = θ 0 g . Therefore, solving the doubly weighted problem identifies the parameter even if the weights are wrong. In general, the parameter that solves (13) will be different from the one that solves the same problem with correct weights. 15 But as long as θ 0 g is a unique solution, solving (13) will identify it.
The following two theorems establish consistency and asymptotic normality of the doubly weighted estimator.
Theorem 4. (Asymptotic Normality under strong identification) Under the assumptions of theorem 3 and the regularity conditions of theorem 2 where MLE estimatorsγ andδ have probability limits given by γ * and δ * , then where Ω g = E l ig l ig with H g and l ig defined in Theorem 2 except with asymptotic weights given by ω * ig . Substantively, there is no real difference in the proof of the above theorem compared to those derived in section 4 except that nowγ andδ are converging to probability limits that could be potentially different from those indexing the true treatment and missing outcome probabilities. A consequence of the objective function solving the conditional problem is reflected in the asymptotic variance expression. Compared to the previous section, Ω g now is simply the variance of weighted score of the objective function without first stage adjustment of the estimated probabilities. This is because under assumption 6, E l ig b i = E l ig d i = 0. A sketch of the proof for E l ig b i = 0 is provided below. The argument for E l ig d i follows analogously.
where ζ g (X) is a function of weights. The first equality uses the definition of l ig with misspecified weights and second equality applies LIEs with unconfoundedness and MAR. In other words, the reason for obtaining a simpler expression for Ω g is because the correlation between weighted score of the objective function and the two binary response scores is zero when θ 0 g is correctly specified for a conditional feature of interest and we use an appropriate method to estimate it. A simpler expression for Ω g also means that we can no longer exploit these correlations between scores to obtain asymptotic efficiency for estimating θ 0 g . Again, letθ g be the doubly weighted estimator that uses true weights, ω g , then Corollary 2. (No gain with estimated weights under strong identification) Under the assumptions of theorem 4 Hence knowledge of the weights does little when for instance we have a correctly specified CEF or CQF and we use either OLS or QR to estimate the parameters indexing these conditional models of interest.
A special case of weights misspecification is when ω * g is a constant. This is plausible since R(X, W, δ * ) and G(X, γ * ) are allowed to be any bounded positive functions of X and W . In other words, the unweighted estimator,θ u g , which does not weight to correct for either problem is also consistent for θ 0 g under the results of theorem 3. In fact, assumptions 7 and 8 suggest that any weighted estimator will suffice for estimating θ 0 g . In this case, one may turn to asymptotic efficiency to guide our choice between weighting or not weighting at all. The following result says that if the objective function satisfies the generalized conditional information matrix equality (GCIME), the unweighted estimator is asymptotically more efficient than any of its weighted counterpart (correctly specified weights or not).
Corollary 3. (Efficiency gain with unweighted estimator under GCIME) Under assumptions of theorem 4 if we additionally suppose that the objective function satisfies GCIME in the population which is defined as: Then, Avar The proof of this theorem follows from noting that we can express the difference in the two asymptotic variances as the expected outer product of population residuals from the regression of B i on D i , which are weighted versions of square root of matrix A i (See appendix F for details). Hence the difference is positive semi-definite.
We know GCIME is known in a variety of estimation contexts. In the case of full maximum likelihood, GCIME holds for q(Y (g), X, is the true conditional density with σ 2 0g = 1. For estimating conditional mean parameters using QMLE in the linear exponential family (LEF), GCIME holds if In other words, GCIME will be satisfied if Var(Y (g)|X) satisfies the generalized linear model assumption, irrespective of whether the higher order moments of the conditional distribution correspond to the chosen QLL or not. For estimation using nonlinear least squares, GCIME will hold for q(Y (g), X, θ g ) = [Y (g) − m(X, θ g )] 2 with the homoskedasticity assumption. Hence in all these cases the unweighted estimator will be more efficient than its weighted counterpart. But when GCIME is not satisfied, the two may not be easy to rank.

Estimation of treatment effects
The asymptotic theory can now be used to discuss estimation of specific causal estimands like ATE and QTEs which can be expressed as functions of the doubly weighted estimator, θ 0 g .

Average treatment effect
As discussed in Słoczyński and Wooldridge (2018), DR estimators remain consistent for the population ATE despite misspecification in either the conditional mean function or the propensity score, but not both. The current doubly weighted framework along with results developed in sections 4 and 5 allow us to extend this result to the case with missing outcomes. Let m(X, θ g ) be a parametric model for the conditional mean which is said to be correctly specified for the CEF if for some θ 0 Then, let us consider the following two scenarios in turn.

Double robustness
First half: Correct conditional mean When the conditional mean function is correct, there is more than one estimation method that can be used to consistently estimate θ 0 g , namely, nonlinear least squares (NLS) and QMLE with LEF. For both these examples, results from section 5 dictate that weighting is not needed for consistency. The fact that one could weight by the misspecified weights and still consistently estimate θ 0 g is what forms the 'first part' of the DR result with double weighting.
Once θ 0 g has been estimated by solving the sample version of the NLS or QMLE problem, ATE can be estimated as follows, If in addition to having a correct conditional mean, I also assume the error variance of the outcomes to be homoskedastic E[U 2 (g)|X] = Var[U (g)|X] = σ 2 0g , then the NLS estimator that does not weight at all is the preferred alternative from an efficiency perspective. This is due to GCIME being satisfied with NLS under homoskedasticity.
Second half: Correct weights If one acknowledges misspecification in the conditional mean model, there is no general way of consistently estimating the ATE. However, a useful mean fitting property of QMLEs in LEF along with double weighting can be used here to obtain consistent estimates of the unconditional means, E[Y (g)], despite misspecification in the conditional means, E[Y (g)|X]. 16 In the generalized linear model (GLM) literature, the link function, h −1 (·), relates the mean of the distribution to a linear index as follows The estimation strategy then is to choose m(X, θ g ) to be the function, h(·), with the QLL corresponding to a choice of LEF density. Then the population first order conditions from solving this QMLE problem give us where v[h(·)] is variance of the mean function and θ * g denotes the pseudo true parameter indexing the misspecified conditional mean model [White (1982)]. In particular, by choosing h −1 (·) to be the canonical link for the QLL associated with the density, the gradient in numerator of (16) cancels with the variance term in the denominator. Note that this occurs only when one uses the canonical link function.
Such cancellation of terms ensures that if one includes an intercept in X, the misspecified mean model fits the overall mean of the distribution (see Wooldridge (2010) chapter 13 for more detail) so that, With nonrandom assignment and missing outcomes, solving the sample GLM FOC in (16) would still not be sufficient for consistently estimating θ * g . Therefore, one would instead solve the doubly weighted FOC given below.
The role played by weighting is crucial here forθ g to be consistent for the pseudo true parameter θ * g . This forms the 'second half ' of the DR result with double weighting. 17 If h(·) is the identity function, the first order conditions above can be recognized as those belonging to OLS with the line of best fit passing through the mean of Y . This is because OLS is a QMLE with normal QLL and identity link function, typically used for outcomes with unrestricted support. Other combinations of QLLs and canonical link functions can be found in Table 2 of Negi 16 The property of QMLEs that we are most familiar with is the one where parameters in a correctly specified conditional mean can be consistently estimated if we choose m(X, θ g ) so that it's range corresponds to the chosen LEF density (or QLL function), irrespective of the range and nature of the outcomes. This property is used in the first half of DR. 17 Section F in the online appendix provides a detailed proof of how population GLM FOCs identify the unconditional means (and hence the ATE). and Wooldridge (2020) and have to be chosen depending on the range and nature of Y .

Summary. DR estimation of ATE with double weighting
Case 1: Correct mean, misspecified weights 1. Consistent estimates for the conditional mean parameters, θ 0 g , can be obtained by either using NLS or QMLE in LEF.

A consistent estimator of ATE is obtained aŝ
Case 2: Misspecified mean, correct weights 1. Depending upon the range and nature of the outcome, Y , choose an appropriate QLL associated with an LEF density. Choose the mean function, m(X, θ g ) = h(Xθ g ), where h(·) is the inverse canonical link function associated with the chosen density. Using this combination of mean function and QLL, use the moment conditions in (17) to obtain consistent estimates, θ g .

Consistent estimates of ATE can then be obtained as followŝ
where X includes an intercept andθ g solves the GLM first order conditions.

Quantile treatment effects
Unlike the case of ATE, it is generally not possible to obtain UQTE by averaging CQTE over the distribution of X. In this section, I use double weighting to illustrate estimation of three different quantile estimands, namely, UQTE, CQTE, and a weighted linear approximation (LP) to the true CQTE, each of which may be of interest to the researcher depending on whether features of the conditional or unconditional outcomes distribution are of interest. Whether θ 0 g indexes the true CQF or an approximation depends on what is being assumed about the conditional quantile model and the estimation method used.
Let's assume that the two potential outcomes are continuous in R. It is typical to define the τ th quantile of Y (g) as Then the UQTE for the τ th quantile is defined as the difference in the marginal quantiles of the outcomes distributions, Similarly, one may define the τ th conditional quantile of Y (g) for X = x as, where F g (·|x) denotes the conditional distribution function of Y (g) given X = x. Then, CQTE for the τ th quantile for some subgroup defined by X is ) be a parametric model for the τ th conditional quantile of Y (g) which is said to be correctly specified if for some θ 0 Estimation of CQTE τ : Incidentally, much like conditional mean, if CQF τ is correctly specified, there are two methods that will ensure consistent estimation of the CQF parameters, θ 0 g (τ ). The first is CQR of Koenker and Bassett (1978) and the second is a class of QML estimators that use a special 'tick-exponential' family of distributions to suggest consistent estimators of conditional quantile parameters. This QMLE class has been proposed by Komunjer (2005). The method is analogous to estimating a correctly specified conditional mean function using QMLE in the linear exponential family.
For estimation that uses CQR, θ g (τ ) will actually solve the stronger conditional problem, For estimation via QMLE, as long as the CQF is correct and we choose an appropriate QLL then, where φ τ (·, ·) is the density that belongs to the tick-exponential family. 18 As dictated by results in section 5, weighting the QR or QML objective functions, irrespective of whether the weights are correctly specified or not will also deliver a consistent estimator of θ g (τ ).
Estimation of LP to CQTE τ : The traditional literature on conditional quantile estimation has focused on correct specification. However, Angrist et al. (2006) establish an approximation property of CQR that is analogous to the approximation property of linear regression. The main implication of such a result is that solving CQR with q τ (X, θ g (τ )) = Xθ g (τ ) would still identify a weighted linear approximation to CQF τ . Therefore, the difference in LPs of τ -quantile CQFs is interpretable as identifying an LP to the CQTE τ .
As before, weighting becomes crucial in the presence of nonrandom assignment and missing outcomes for identifying the LP parameters.
In other words, one would need to weight the CQR problem with correct weighting functions for , which indexes the true LP to CQF τ for group g. Then, Direct estimation of UQTE τ : As mentioned in the beginning of this section, estimating UQTE τ from CQTE τ is generally not possible even if we assume a correct model for the conditional quantiles of Y (g). In other words, one cannot obtain unconditional quantiles from averaging conditional quantiles over the distribution of X. In this case, we can directly estimate Q τ,g by running a quantile regression of the outcome on an intercept (similar to Firpo (2007)). 19 In the present case, the solution to the doubly weighted objective function gives us, such thatθ g (τ ) p → Q τ,g . Weighting by G(·) and R(·) is crucial here since these functions serve to remove biases arising due to nonrandom assignment and missing outcomes. One can then obtain the unconditional quantile treatment effect as, An alternative method of estimating UQTE τ is to use recentered influence functions suggested by Firpo et al. (2009) The next section discusses results from a Monte Carlo study which evaluates the finite sample behavior of doubly weighted ATE and QTE estimators under three different misspecification scenarios.

Simulations
This section compares the empirical distributions of ATE and QTEs using unweighted, ps-weighted, and d-weighted estimators. 20 The discussion is centered around three common misspecification scenarios that are interesting from an empirical standpoint. These cases are enumerated in tables A.1 and A.2 for estimating ATE and QTEs, respectively. Two of them describe situations implicit in the first and second half of the asymptotic theory, whereas the third case considers all three parametric components of the framework to be misspecified. Even though the theory developed in this paper is silent for the third case, simulation results appear to be promising.

Average treatment effect: Results
Case (1) in Table A.1 considers a misspecified mean function but correct probability weights. This is the principal case covered in section 4 wherein weighting is crucial. As one can see, the empirical distribution of the doubly weighted estimator is centered on the true ATE whereas that for the unweighted estimator is shifted to the right (see figure A.1, Case 1).
Case (2) looks at what happens when everything, conditional mean and the two weights, is misspecified. The theory in this paper does not address this particular case. However, this characterizes an interesting possibility given that misspecification of all components is a valid concern. The simulation results do offer some insight here. The doubly weighted estimator seems to be the only choice that delivers the true ATE on average whereas the others distributions are shifted away from the truth (see figure A.1, Case 2).
Finally, case (3) depicts the possibility of a correctly specified conditional mean function but misspecified weights. Here weighting does not have any bite in resolving the identification issue, beyond what is already achieved from having a correct mean function. In figure A.1, case 3, the empirical distributions of the estimated ATE for the unweighted, ps-weighted, and d-weighted estimators all coincide and are centered on the true ATE.

Quantile treatment effects: Results
As discussed earlier, there are really three parameters worth discussing when one talks about QTEs; CQTE, LP to CQTE, and UQTE. Misspecification in the CQF shifts attention to consistently estimating a linear projection to the true CQTE. First case in Table A.2 considers exactly such a scenario. Using the results in Angrist et al. (2006), I interpret the solution to the doubly weighted problem given in (21) as providing a consistent weighted linear projection to the true CQF which is then used to estimate an LP to the true CQTE. Case 1 of Figure A.2 plots the bias in estimated LP relative to the true LP as a function of X 1 for the three estimators. Note that weighting here is crucial for consistently estimating the LP. The relative bias of the doubly weighted estimator is the lowest amongst all and coincides with the line of no bias. Case 2 considers the situation when along with a misspecified CQF, the weights are also wrong. We still find the proposed estimator performing the best in terms of bias. Finally figure A.3 considers a correctly specified CQF in which case we can estimate the CQTE. 21 One can observe in the figure that the estimated function using double weighting coincides with the true CQTE irrespective of how we weight. All three estimators; unweighted, ps-weighted, and doubly weighted will be consistent for the true CQTE. Misspecification in the weights will not affect this result. I also consider direct estimation of UQTE which does not require parametric specification of the CQF since it is simply a difference of the marginal quantiles. So the two weights are the only relevant components of the framework which will affect consistency of UQTE. In figure A.4, case 1, when both weights are correct, not weighting and double weighting both bring us close to the true parameter. For the second case where both probability models are misspecified, double weighting does a little worse than not weighting at all. However, the results at other quantiles reflect more favorably upon double weighting (see section H of the online appendix for results at 50th and 75th quantiles). Propensity score weighting performs the worst in both cases suggesting instances where weighting for nonrandom assignment after dropping data that is missing may not be the preferred alternative.
[ Figure  8 Returns to job training In this section, I apply the proposed estimator to the Aid to Families with Dependent Children (AFDC) sample of women from the National Supported Work program compiled by Calónico and Smith (2017) (CS, thereafter). NSW was a transitional and subsidized work experience program which was implemented as a randomized experiment in the United States between 1975-1979. CS replicate LaLonde (1986)'s within-study analysis for the AFDC women in the program, where the purpose of such an analysis is to evaluate how training estimates obtained from using nonexperimental identification strategies (for example, CIA) compare to experimental estimates. To compute the non-experimental estimates, CS combine the NSW experimental sample with two non-experimental comparison groups drawn from PSID, called PSID-1 and PSID-2. 22 In this paper, I utilize the within-study feature of this empirical application to estimate how close the doubly weighted estimates get to the experimental estimate compared with ps-weighting and unweighted estimates.
To construct these empirical bias measures, I first augment the CS sample to allow for women who had missing earnings information in 1979. This renders 26% of the experimental and 11% of the PSID samples missing. I then combine the experimental treatment group of NSW with three distinct comparison groups present in the CS dataset, namely, the experimental control group, and the two PSID samples, to compute the unweighted, ps-weighted, and d-weighted training estimates. 23 The difference in the non-experimental estimate, obtained from using the doubly weighted estimator, and the experimental estimate provides the first measure of estimated bias associated with the proposed strategy. Combining the experimental control group with the nonexperimental comparison group gives a second measure of estimated bias [Heckman et al. (1998)]. Much like CS, I report both these estimates across a range of regression specifications for the average returns to training estimates.
Given the growing importance of estimating distributional impacts of job training programs, I also estimate returns to training at every 10th quantile of the 1979 earnings distribution. The role of double weighting is highlighted for the case of estimating marginal quantiles since covariates, 22 The PSID-1 sample constructed by CS involves keeping all female household heads continuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not retired in 1975. The sample labeled PSID-2 further restricts PSID-1 to include only those women who received AFDC welfare in 1975. 23 For details regarding sample construction and estimation of weights, see section E of the online appendix.
which primarily serve to remove biases arising from nonrandom assignment and missing outcomes, enter the estimating equation only through the two weights.

Results
First, to evaluate whether women with missing earnings in 1979 were significantly different than those who were observed, Table A.2 reports the mean and standard deviation of the woman's age, years of schooling, pre-training earnings and other characteristics across the observed and missing samples. In terms of age, the women who were observed in the experimentally treated group of NSW and the PSID-1 sample were, on average, older than those who were missing. The observed women in PSID-1 were also more likely to be married. For the PSID-2 sample, women who were observed had, on average, more kids with higher pre-training earnings. Apart from these minor differences, the observed women did not appear to be systematically different that those who were missing, as measured through observable characteristics. The presence of non-experimental control groups implies that assignment was nonrandom and therefore an issue in the sample. This is because the comparison groups were drawn from PSID after imposing only a partial version of the full NSW eligibility criteria. Table A.1 provides descriptive statistics for the covariates by the treatment status. As can be expected, the treatment and control groups of NSW are not observably different, indicating the strong role that randomization plays in producing comparable groups. In contrast, the women in PSID-1 and PSID-2 groups are statistically different than the treatment group members implying substantial scope for nonrandom assignment. Table A.3 reports the d-weighted, ps-weighted and unweighted average returns to training estimates which using three different comparison groups; NSW control, PSID-1 and PSID-2. The unweighted (unadjusted and adjusted) experimental estimates given in row 1, are same as the estimates reported by CS in Table 3 of their paper. Overall, one can see that the doubly weighted experimental estimates are more stable than the single weighted or unweighted estimates across the different regression specifications, with a range between $824-$828.
For computing the ps-weighted and d-weighted non-experimental estimates, I first trim the sample to ensure common support between the treatment and comparison groups. 24 This reduces the sample size from 1,248 to 1,016 observations for the PSID-1 estimates and from 782 to 720 observations for the PSID-2 estimates. A pattern that is consistent across the two sets of nonexperimental estimates is that weighting gets us much closer to the benchmark relative to not weighting at all. For instance, the unweighted simple difference in means estimate of training, which uses the PSID-1 comparison group, is -$799 whereas the weighted estimates are $827 and $803. For the PSID-2 comparison group, the unweighted estimate which controls for all covariates is $335 whereas the weighted estimates are $905 and $904.
The second panel of Table A.3 reports the bias in training estimates from combining the experimental control group with the PSID comparison groups. A similar pattern is seen here with weighted bias estimates being much closer to zero than the unweighted estimates. For instance, the doubly weighted estimate that adjusts for all covariates using the PSID-1 comparison group is -$21 whereas the unweighted estimates is -$568. These results suggest that the argument for weighting is strong when using a non-experimental comparison group where nonrandom assignment and 24 Appendix E describes estimation of the two probability weights along with the sample trimming criteria. missing outcomes are significant problems. 25 Figure A.5 plots the relative bias in UQTE estimates at every 10th quantile of the 1979 earnings distribution. Much like the average training estimates, we see that the weighted estimates consistently lie below the unweighted estimates for most quantiles, irrespective of whether we use the PSID-1 or PSID-2 non-experimental group. Note that I do not plot UQTE estimates for quantiles less than 0.46, since these are all zero. 26 This empirical application illustrates the role of proposed estimator in both experimental and observational data contexts. The comparison involving the treatment and control group of NSW demonstrates its use in an experiment with missing outcomes, whereas the non-experimental sample demonstrates its use in the more realistic observational data setting. [

Conclusion
In empirical research, the problems of nonrandom assignment and missing outcomes threaten identification of causal parameters. This paper proposes a new class of consistent and asymptoticallynormal M-estimators that address these two issues using a double weighting procedure. The method combines propensity score weighting with weighting for missing outcomes in a general M-estimation framework, which can be applied to a range of estimation methods, such as ordinary least squares, quasi maximum likelihood, and quantile regression. In addition, the proposed class has a robustness property which allows us to estimate meaningful causal quantities of interest despite misspecification in either a conditional model of interest or the two weighting functions. As leading applications, the paper discusses estimation of ATE and QTEs. A Monte Carlo study indicates that the doubly weighted estimates of average and quantile treatment effects have the lowest bias compared to naive alternatives (unweighted or propensity score weighted estimators) under three realistic cases of misspecification. Finally, the estimator is applied to the data on AFDC women from the NSW program compiled by Calónico and Smith (2017). The presence of experimental and non-experimental comparison groups in this application help to quantify the estimated bias in the doubly weighted returns to training estimates as well as the other two estimators.
Since the severity and magnitude of bias introduced from ignoring either problem cannot be assessed ex-ante, a safe bet from the practitioner's perspective is to report both doubly weighted and unweighted causal effect estimates. Practically, the doubly weighted estimator for the ATE is easy to implement. Appendix D provides an example code that uses Stata's gmm command for implementing it. Computation of analytically correct standard errors, however, requires additional coding and is still a work in progress. Alternatively, one can use bootstrapped standard errors which will provide asymptotically correct inference.
Even though missing outcomes are a common concern in empirical analysis, it is equally common to encounter missing data on the covariates. A particularly important future extension can be to allow for missing data on both. In this case, using a generalized method of moments framework which incorporates information on complete and incomplete cases could provide efficiency gains over just using the observed data. A different possibility would be to relax the identifying restrictions to allow for selection on unobservables and possibly explore estimation of local average treatment effect (LATE).
HIRANO, K. AND G. W. IMBENS (2001): "Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization," Health Services and Outcomes research methodology, 2, 259-278.     The treatment and missing outcome propensity score models have been estimated as flexible logits and the samples used for constructing these estimates have been trimmed to ensure common support across the two groups. The treatment propensity score has been estimated using the full experimental sample along with either PSID-1 or PSID-2 comparison group. The UQTE estimates for τ < 0.46 are omitted from the graph since these are zero. Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means. Column 4 tests for differences between the NSW treatment and control groups, column 6 and 8 report the same using PSID-1 and PSID-2 comparison groups respectively. Real earnings in 1975 are expressed in terms of 1982 dollars.  Notes: This table reports unadjusted and adjusted post-training earnings differences between the NSW treatment and three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The first row reports experimental training estimates which combines the NSW treatment and control group whereas the second and third rows report non-experimental estimates computed from using the PSID-1 and PSID-2 groups respectively. Each of the non-experimental estimates should be compared to the experimental benchmark. The second panel of the table reports bias estimates computed from combining the NSW control with PSID-1 and PSID-2 comparison groups respectively. These represent a second measure of bias which should be compared to zero. Bootstrapped standard errors are given in parentheses and have been constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates have been trimmed to ensure common support in the distribution of weights for the treatment and comparison groups. For more detail, see appendix E.

Abstract
In this online appendix, section A provides details of the simulation study. Section B discusses an extension of the doubly weighted framework to the case of estimating unconditional quantile treatment effects using recentered influence functions. Section C provides a simple extension to the case when treatment assumes multiple values. Section D provides the asymptotic variance expressions for the average treatment effect under the first and second half of asymptotic theory. Section E provides some background information on the National supported work demonstration along with augmenting Calónico and Smith (2017)'s sample for missing information and trimming rules for the probability weights. Section F contains proofs for results in the main text. Finally sections G and H provide supplementary tables and figures, respectively.

A Simulation details
This section outlines details of the simulation study for evaluating the finite sample behavior of unweighted, ps-weighted, and d-weighted (doubly weighted) estimators of ATE and QTE parameters. For each data generating process, the population is generated using a million observations. The empirical distributions of ATE and QTE estimands are simulated from drawing random vectors {(Y i , X i , W i , S i ); i = 1, 2, . . . , N } of size N a thousand times without replacement from the population. This is done to mimic the setting of "random sampling" from an infinite population.

A.1 Average treatment effect
To allow for possible misspecification of the regression functions E[Y (g)|X], I simulate two binary potential outcomes generated using a probit as follows Note that X here includes an intercept. The linear index, Xθ 0 g , is parameterized to have covariates be only mildly predictive of the potential outcomes with R 2 0 = 0.19 and R 2 1 = 0.14 in the population. 1 The two covariates and the two latent errors are drawn from two independent bivariate normal distributions as follows, The assignment and missing outcome mechanisms have been simulated to ensure that unconfoundedness and MAR are satisfied with the errors ν and υ drawn from two independent standard logistic distributions. 2 Misspecification in the true assignment and missing outcome distributions is allowed in both the functional form and linear index dimension where for the misspecified cases, I estimate a probit with X 1 omitted from the linear index. For scenarios where the conditional mean is misspecified, I estimate a linear model with a correct index. The parameters, γ 0 and δ 0 , indexing the assignment and missingness mechanisms have been chosen to ensure average propensity of assignment to be 41% and average propensity of being observed to be 38%. 3 The missing data have been simulated to imitate empirical settings where a significant portion of the outcomes are missing. The following table gives an estimation summary for the eight different cases of misspecification,

Model Estimation Model Estimation Model Estimation
Notes: C and M correspond to whether the estimated model is correctly specified or misspecified. X and Z both include an intercept. X (1) and Z (1) are the subsets of X and Z left after omitting X 1 . G(·) refers to the propensity score model and R(·) refers to the missing outcomes probability model. 1 Here θ 0 0 = (0, 1, 1) and θ 0 1 = (−1, 1, 1) . With cross sectional data, covariates are typically seen to be mildly predictive of the outcome. For example, in the National Supported Work dataset from Calónico and Smith (2017), baseline factors explain about 26-50 percent of the variation in the non-experimental sample and about .04-2 percent in the experimental sample depending upon the included subset of covariates.

A.2 Quantile treatment effects
To ensure that the marginal quantiles of the potential outcome distributions are unique with no flat spots, I simulate two continuous non-negative outcomes as follows, where θ 0 1 = (0.1, −0.36, −0.1) and θ 0 0 = (0.2, 0.24, −0.45) are parameterized to ensure R 2 0 = 0.15 and R 2 1 = 0.13 in the population. The two covariates and the two latent errors are drawn from two independent normal distributions following (A.1). The missing outcomes and the treatment assignment mechanisms are also generated according to eq (A.2). Since exp(·) is an increasing continuous function, the equivariance property of quantiles implies that where Φ −1 (τ ) is the inverse standard normal CDF evaluated at τ . This equivariance property helps to characterize and estimate CQTE for cases when the CQF is correct. The three different cases of misspecification are enumerated in Table A.2 below. Case 1 corresponds to the situation for which results are derived in section 4, Case 2 allows for misspecification in both conditional quantile function and the probability weights. Even though the theory in this paper does not address that specific case, the simulation results show that the proposed estimator has the lowest bias among all three alternatives. Finally, case 3 relates to situations considered in section 5; correct CQF but misspecified weights.

Model Estimation Model Estimation Model Estimation
Notes: C and M denote whether the estimated model is correctly specified or misspecified. X and Z both include an intercept. X (1) and Z (1) are the subsets of X and Z left after omitting X 1 . Therefore, the probability models have been misspecified in both the functional form and the linear index dimension. G(·) refers to the propensity score model and R(·) refers to the missing outcomes probability model.
For plotting the estimated and true CQTE functions, I first collect the estimates that solve the unweighted, ps-weighted, and doubly weighted CQR problem (defined in (19)) corresponding to a particular quantile level, τ = 0.25, 0.50, 0.75 across 1,000 Monte Carlo simulation draws. I then draw a linearly spaced vector of values for X 1 and simulate the CQTE using the 1,000 estimated conditional quantile coefficients. Averaging these 1,000 functions at each point on the X 1 vector gives me the estimated average CQTE function. I plot this along with the 1,000 individual functions and the true CQTE, which is calculated using the population conditional quantile parameters, θ 0 g .

B Unconditional quantile treatment effect using recentered influence functions
This section discusses an alternative method of estimating UQTE using Firpo et al. (2009)'s (FFL, thereafter) recentered influence function (RIF) methodology. Following FFL, let v(F ) be a real valued functional such that v : F → R whose domain F is a class of distribution functions such that F ∈ F if v(F ) < +∞. One may define v(·) to be any distributional statistic of interest like mean, variance, quantiles, inequality indices etc. We can define various treatment effects as the difference in the functionals of the marginal outcome distributions is the functional of the distribution function for Y (g). 4 As defined in FFL, the RIF is nothing but the influence function which has been centered at the statistic v g . Formally, where IF(Y (g); v, F g ) captures the change in v g as a result of an infinitesimal change in the distribution of X. FFL introduce the idea of running a standard regression of RIF on X with the objective of estimating the function E RIF(Y (g); v, F g )|X = Xθ 0 g One can then use the law of iterated expectations to express v g in terms of the regression function as follows, For v g = Q τ,g , equation B.2 defines the UQTE for the τ th quantile. We know that the RIF for Q τ,g is given as: is the density of Y (g). 5 Then estimation of doubly weighted UQTE using RIFs involves the following steps: is the non-parametric kernel density estimator with bandwidth h g .
where double weighting has to be performed at each stage that uses the observed sample. This implies that for ensuring consistency of UQTE, the weights would necessarily have to be correctly specified. One may estimate the weights nonparametrically using sieves to sidestep this issue of misspecification. Estimating UQTE in this manner also has the advantage initially put forth in FFL which is that one can directly estimate the effect of covariates on UQTE.

C Multivalued Treatments
One can easily extend the binary treatment case considered here to the case when there are multiple treatment values. Let Y (g) denote the potential outcome for treatment level g where g = 0, 1, . . . , T and W g be a binary indicator for receiving treatment level g such that Let ρ g (x) ≡ P(W g = 1|X = x) be the propensity score and r(x, w) ≡ P(S = 1|X = x, W g = w) be the missing outcomes probability for treatment level g. One may then consider solving the same population problem, Q 0 (θ 0 ) but with true weights given as To construct the doubly weighted estimator, we would assume unconfoundedness and MAR along with assuming parametric models for the two probability weights; R(X, W g , δ) and G(X, γ g ).

√
N consistent and asymptotically normal estimators,θ 1 andθ 0 , the estimated average treatment effect∆ is easily shown to also be √ N -consistent and asymptotically normal [Wooldridge (2010) chapter 21]. Regularity conditions for such an asymptotic result would require that the parametric model, m(X, θ g ), is continuously differentiable on the parameter space Θ g ⊂ R Pg and θ 0 g is in the interior of Θ g . Then, by the continuous mapping theorem and slutsky's theorem, where H g is the Hessian for the treatment group g, and u ig is the residual from the regression of the weighted score on the scores of two probability models. For the case when the conditional mean model is correctly specified, the variance expression simplifies to Here V 1 and V 0 are the asymptotic variances of the doubly weighted estimator that solve the treatment and control group problems, respectively. The above formula makes it clear that it better to use more efficient estimators ofθ g . But we know from the results in section 5 that when the conditional mean model is correctly specified, using estimated weights is as efficient as using known weights. Another alternative in this case is to use unweighted estimators of θ 0 g since under GCIME, unweighted estimators is more efficient than the doubly weighted estimators of θ 0 g . For the case when the mean model is misspecified, the asymptotic variance of the ATE is given as follows In this case, the variance expression is a bit more complicated than the previous case. Even though it is better to have more efficient estimators of θ 0 g in this case as well, it is not obvious whether that would help obtain a smaller variance for the ATE since we now have cross correlation terms in the variance expression.

D.1 Proofs
Asymptotic variance expression for ATE: Correctly specified mean model. Assuming continuous differentiability of m(X i , θ g ) on Θ g , mean value expansion around θ 0 g gives whereθ g lies betweenθ g and θ 0 g . Sinceθ g p → θ 0 g , so doesθ g . Hence, using the weak law of large numbers, we obtain Adding and subtracting Then, using the asymptotic results from section 5, where we posit that the conditional feature of interest is correctly specified, we have We may rewrite the above using the influence function representation as Note that the covariance term involving l ig is zero since they denote scores for the treatment and control group problems. The covariance terms involving m(X i , θ 0 1 ) − m(X i , θ 0 0 ) − ∆ ate and l ig will also be zero. This is because θ 0 g solves the conditional problem. However, using that fact that E[h(Y i (g), X i , θ 0 g )|X i ] = 0 along with LIE, those covariance terms can be shown to be zero.

Misspecified mean model
In the case of a misspecified mean model, we still have Now using results from section 4

E.1 Description of National Supported Work Program
The NSW was a transitional and subsidized work experience program that was mainly intended to target four sub-populations; ex-offenders, former drug addicts, women on AFDC welfare and high school dropouts. 6 The program became operational in 1975 and continued until 1979 at fifteen locations in the United States. In ten of these sites, the program operated as a randomized experiment where individuals who qualified for the training program were randomly assigned to either the treatment or control group. 7 At the time of enrollment in April 1975, individuals were given a retrospective baseline survey which was then followed by four follow-up interviews conducted at nine month intervals each. The survey data was collected using these baseline and follow-up interviews over a period of four years. The data included measurement on baseline covariates like age, years of education, number of children in 1975, high school dropout status, marital status, two race indicators for black and Hispanic sub-populations and other demographic and socio-economic information. The main outcome of interest was real earnings for the post-training year of 1979.

E.2 Augmenting the CS sample to account for missing earnings in 1979
I obtain the data from CS's supplementary data files in the Journal of Labor Economics where the authors recreate the experimental sample on AFDC women using the raw public use data files maintained by the Inter-University Consortium for Political and Social Research (ICPSR). Then, I use the PSIDcross file provided by CS along with other supplementary data files to add back the individuals whom CS originally dropped from the analysis for not having valid earnings information between [1975][1976][1977][1978][1979]. For this, I apply the same filters applied by CS who use them to match their PSID samples to the ones used by LaLonde (1986). These filters involve keeping all female household heads continuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not retired in 1975. 8 This constitutes the first non-experimental sample that CS use in their analysis, which they call the PSID-1 sample. The second PSID sample, which they label PSID-2 further restricts the PSID-1 sample to include only those women who received AFDC welfare in 1975. 9 In order to compare my sample with the original sample used by CS, I first apply all the above mentioned filters and create a dummy variable which I call "cs". Next, I remove the filter which requires the women to be continuous household heads and instead only impose that filter for 1975 and 1976. The reason this filter is imposed for both years 1975 and 1976 but not for any other years is because in the PSID datasets, the income information in a particular year corresponds to the previous calendar year. This ensures that merging the cross-file with the separate single-year files for 1975 and 1976 guarantee that only those women are included who do not have any missing earnings information for the pre-training year of 1974 and 1975. This is important since pre-training earnings are treated as any other baseline covariate in this paper, on which I do not allow any missing information.
After merging cross year individual file with the single year family files, I then merge this PSID dataset with the NSW dataset using CS's .do files and generate the various sample dummies essentially in the same manner as they do. After this, I further restrict the sample to include only those women who have valid earnings information in 1975, which is the pre-training year for AFDC women. I also drop the cases where the measured age or education is less than zero. In order to make sure that any observations not used by CS only correspond to the ones that have missing post-program earnings, I also drop observations that do not satisfy the CS criteria but have observed earnings in 1979.

E.3 Treatment and missing outcome probability specifications and sample trimming
In this application, I estimate three sets of treatment assignment and missing outcomes probability models depending upon which comparison group is used for obtaining the estimates. For the experimental estimates, I use the experimental treatment and control groups to estimate the propensity score model. For the PSID-1 estimates, I consider the NSW experimental observations to be the treatment group and use PSID-1 as the control group. For estimating the PSID-2 propensity score model, I switch to PSID-2 as being the comparison control group. For estimating the missing outcome probability models, I include the treatment indicator depending upon the comparison group as mentioned above. The probability models are estimated as logits and include the following covariates in their specification. For the treatment probability, I include the real earnings in 1974 and 1975 along with an indicator variable for whether the individual had any zero earnings in 1974 and 1975. Beyond these, I also include Age, Age-squared, Education, High school dropout status, the race indicators of black and Hispanic along as well as the number of children in 1975. CS also add some interaction terms in their propensity score specification which I do not. I noticed that allowing for those terms in my specifications drove the final weights for many women in the sample too close to a 0 or 1. For the missing outcomes probability, I include the treatment indicator along with the same covariates. I kept the specifications to be the same for the three sets of probabilities I estimated. However, my regression specifications include the same covariates as CS to allow for some comparison across the analyses. These comparisons should be made with some caution. Except the estimates that use the NSW control group, all other estimates are obtained using samples that are different than the CS samples. The final sample used to obtain estimates for the PSID-1 comparison group is trimmed in order to ensure common support for the weights in the treatment and comparison groups. For the PSID-1 group, this meant dropping observations with final weight either less than 0.03 or greater than 0.8. For the PSID-2 sample, this meant dropping observations with final weight that was either less than 0.1 or greater than 0.86. These final weights are the weights that are specified in the regression commands in Stata and are constructed as follows: weight = (w/Ghat+(1-w)/(1-Ghat)) * (s/Rhat) The trimming threshold for ps-weighted estimates is kept the same as for computing the doubly weighted estimates since the overlap problem was relatively more severe when using the composite weights than when using propensity scores only. The graphs below plot the kernel density for the probabilities Rhat * Ghat for the treatment group and Rhat * (1-Ghat) for the control group. The common support problem due to which the samples were appropriately trimmed can be seen in figure E.1.
Additionally, figures E.2 and E.3 plot the estimated distributions for the propensity score and missing outcomes probability, where panel (a)-(c) display these for the three treatment and comparison group combinations. A couple of points emerge from the estimated graphs. For figure E.2, panel (a), we see that the treatment and control distributions appear very similar, confirming the strong role of randomization in producing groups that are balanced in terms of covariates. For panel (b), we see that the experimental observations have a relatively high probability of being treated whereas the control group have low probabilities. Note, however, that the common support condition holds quite strongly for the PSID-1 group. In panel (c), while the estimated distribution for the treated units still has a higher mean, the PSID-2 comparison group distribution is relatively similar than PSID-1 in panel (b). These findings suggest that nonrandom assignment is predicted well by the covariates in the propensity score distributions. The same cannot be said for the estimated missing outcomes probabilities where panel (b) and (c) reveal a strong overlap problem. Moreover, we see that the treated units are less likely to be missing outcomes compared to the comparison groups. Notes: The weights here correspond to the product of the estimated assignment and missing outcomes probabilities. Following CS (2017), I exploit the efficiency gain from combining the experimental treatment and control groups for estimating the treatment and missing outcome probability models. For the PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group.   (2017), I exploit the efficiency gains from combining the experimental treatment and control groups for estimating the missing outcome probability. For the PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group.

F Proofs
Proof of Lemma 1. Let us first consider the argument for θ 0 1 . By LIE and using the fact that Using another application of LIE along with unconfoundedness, we obtain where the third equality follows from MAR and fourth follows from part ii) of Assumption 3. The proof for θ 0 0 follows analogously.
Proof of Theorem 2. Explicit dependence on data is suppressed for notational simplicity. Then expanding ω ig around ω ig , whereδ lies betweenδ and δ 0 andγ lies betweenγ and γ 0 . Then, consider where first inequality holds by cauchy-schwartz, second holds due to the definition of supremums, and third by conditions iv) and vi). Then, where the first inequality holds trivially and second inequality holds because of (F.4). An analogous argument may be made for showing E sup θg∈Θg,γ∈Γ h(θ g )d (γ) < ∞. Using the fact that ω g (γ, δ) is continuous and bounded along with continuity of l(θ g ) (condition ii)), b(δ), d(γ) (condition iii) of theorem 1), we obtain using Lemma 4.3 in Newey and McFadden (1994) asγ → p γ 0 andδ → p δ 0 . Rewriting (7) using influence function representations forγ andδ along with (F.5) Next part of the proof uses the theory of empirical processes for obtaining asymptotic normality of the doubly weighted estimator. Using the definition in (11) along with the fact that (by continuity of ω(γ, δ)h(θ g ), condition iv) and DCT as (γ,δ) Then performing element by element mean value expansions of m * N (θ g ) around θ 0 g , we obtain whereθ g lies betweenθ g and θ 0 g . Since the population first order condition is zero at the truth The second equality follows from dominance condition iv) and application of Lemma 3.6 in Newey and McFadden (1994). Then, by the continuity of ∇ θg E ω ig h i (θ g ) (condition vi)) By continuous mapping theorem and condition viii), 1) by asymptotic equivalence in (F.7) and stochastic equicontinuity by condition ix). Moreover, by (F.6). Then using (F.8) along with slutsky's theorem, Proof of Corollary 1. Consider, since each component matrix in the above expression is positive semi-definite, therefore the sum of the two matrices is also positive semi-definite.
Proof of Theorem 3. It has already been established that θ 0 g solves E ω * g · q(Y (g), X, θ g ) The proof of uniform convergence follows similar to the proof of theorem 1 where we replace ω g by ω * g . Then, consistency ofθ g for θ 0 g follows from Theorem 2.1 in Newey and McFadden (1994).
Proof of Theorem 4. The proof follows in the manner of Theorem 2 where we replace ω g by ω * g . Also, Ω g now denotes the variance of the score of the objective function, l ig , without the first stage adjustment for the estimated weights. This is because, E(l ig b i ) = E(l ig d i ) = 0 because the conditional score of l ig , E[h(Y (g), X, θ 0 g )|X] = 0 due to strong identification of θ 0 g . Proof of corollary 2. This proof follows from the proof of theorem 4, and the asymptotic variance of the estimator that uses known weights which is where Ω g = E l ig l ig . The result follows immediately.
Proof of Corollary 3 (Efficiency gain with unweighted estimator under GCIME). Using two applications of LIE and invoking MAR and unconfoundedness, I can rewrite Using another application of LIE, I can rewrite the above as Then, Similarly, I use LIE to express Ω 1 as For the unweighted estimator, the variance simplifies, and this happens precisely due to the GCIME. To see this, consider H u 1 . Then using LIE, I can rewrite and similarly we can rewrite Ω u 1 using LIE as Therefore, the asymptotic variance simplifies to simply For showing that the two variances are positive semi-definite consider the following where the quantity inside the brackets is nothing but the variance of the residuals from the population regression of B i on D i . Hence, the difference is positive semi-definite. The results for g = 0 can be proven analogously.
F.1 Identification of ATE using pooled and separate slopes mean functions under second half of DR Pooled slopes. Let us assume that m(X, θ g ) = h(Xθ + ηW ) is the chosen mean function for E[Y (g)|X]. Then, in the presence of nonrandom sampling, we have the following first order conditions γ). Ignoring the last set of moment conditions, the population counterpart to the FOCs above are: where θ * and η * are the probability limits of QMLE estimatorsθ andη. Rearranging (F.9) and (F.10) gives us which implies that we can replace Y in the above two equations to obtain the LHS of (F.11) equal to By using iterated expectations we can rewrite the above equation as Due to MAR, we can split the conditional expectation into parts.
Note that, W · E(S|X, W ) = W · R. similarly, (1 − W ) · E(S|X, W ) = (1 − W ) · R and due to unconfoundedness we have, E Y (1)|X, W = E Y (1)|X and E Y (0)|X, W = E Y (0)|X . Therefore, we can simplify the above expression into Another application of iterated expectation gives us Manipulating the RHS of (F.11) using iterated expectations gives us Therefore, combining the LHS and RHS give the result Now, consider the LHS of F.12.
Similarly using LIE, the RHS of F.12 can be re-written as Therefore combining the LHS and RHS give us Then using F.14 along with F.13 implies that Similarly, we can also show that Hence, the pooled regression adjustment estimator can be written as so a consistent estimator of the QMLE pooled regression adjustment estimator can be obtained by replacing the population expectation by the sample average in the above expression and weighting by the appropriate probabilities to recover the balance of the random sample which gives uŝ Separate slopes. Let us assume that m(X, θ g ) = h(Xθ g ) is the chosen mean function for E Y (g)|X .
Then the population FOCs are where θ * g are the probability limits of QMLE estimatorsθ g . Rearranging F.16 and F.17 just like in the pooled case gives us the following equalities.
Proceeding with the above two equations in the same way as in the pooled case gives us the results Therefore, ∆ F ate = E h(Xθ * 1 ) − E h(Xθ * 0 ) and a consistent estimator of the QMLE separate regression adjustment estimator can be obtained aŝ h(X iθ0 )  (1975) Notes: This table reports unadjusted and adjusted pre-training earnings differences where the first row reports the experimental estimates which combines the NSW treatment and control groups. The second and third row reports non-experimental earnings estimates computed from using the PSID-1 and PSID-2 comparison groups respectively. The second panel of the table reports bias estimates computed from combining the NSW control and PSID-1 and PSID-2 comparison groups respectively. Both the pre-training estimates and the bias estimates should be compared to zero. Bootstrapped standard errors are given in parentheses and have been constructed from using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates using PSID-1 and PSID-2 comparison groups have been trimmed to ensure common support in the distribution of weights for the NSW-treatment and comparison groups. For more detail, see appendix E. Notes: This table reports unweighted, ps-weighted and d-weighted UQTE estimates for three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings distribution. The experimental and PSID-1 estimates have been constructed using N=1,185 and N=1,016 observations respectively. Bootstrapped standard errors are given in parentheses and have been constructed using 1,000 replications. All values are in 1982 dollars. The samples used for constructing these estimates have been trimmed to ensure common support across the treatment and comparison groups. Notes: This table reports unweighted, ps-weighted and d-weighted UQTE estimates for three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings distribution. The experimental and PSID-2 estimates have been computed using N=1,185 and N=720 observations respectively. Bootstrapped standard errors are given in parentheses and have been constructed using 1,000 replications. All values are in 1982 dollars. The samples used for constructing these estimates have been trimmed to ensure common support across the treatment and comparison groups.  The ps-weighted estimator weights to correct only for nonrandom assignment and the d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems. The ps-weighted estimator weights to correct only for nonrandom assignment and the d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems. Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d-weighted UQTE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample is N 1 = 5, 000 × 0.41 × 0.38 = 779 and average control sample is N 0 = 5, 000 × (1 − 0.41) × 0.38 = 1, 121. The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment and the d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems. Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d-weighted UQTE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample is N 1 = 5, 000 × 0.41 × 0.38 = 779 and average control sample is N 0 = 5, 000 × (1 − 0.41) × 0.38 = 1, 121. The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment and the d-weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems.