Estimating Average Treatment Effects Utilizing Fractional Imputation when Confounders are Subject to Missingness

The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation's accuracy and precision with that of multiple imputation.


Introduction
It is commonplace in scientific research for investigators to rely on observational data in order to address their questions of interest. While, randomized experiments are the gold standard for drawing causal inferences about the effect of a treatment (also known as exposure, regime or policy), in many cases, randomized experiments are difficult or infeasible to implement for logistical, financial or ethical reasons. For example, it would be unethical to force people smoke in order to study the causal effect of smoking on health outcome. Instead, researchers must utilize observational data and make careful corrections to address various biases.

Missing Data
Despite best intentions of researchers, missing data is near impossible to avoid in live settings. Fortunately, as prevalent as missingness is, so too are methods with which to address said missingness; however, the type of missingness matters when selecting a method.
Missing data comes about by ways of one of three mechanisms in observational data. The first and simplest type of missingness is that of Missing Completely at Random (MCAR) (Rubin, 1976). In this setting, whether or not an observation is missing is independent of both the observed and unobserved data. The second type of missingness comes under the name of Missing at Random (MAR). Here, whether or not an observation is missing is dependent on the data; however, the relationship can be made independent after conditioning on the observed data. Lastly, missing data can be Missing Not at Random (MNAR). In this setting, even after conditioning on all observed data, significant differences still exist between observed and unobserved observations.
Identifying which missingness type is present for a given data set is difficult to impossible to validate in observational data, but content experts with knowledge of the data can make reasonable arguments towards one of these assumptions or another. Most commonly the MAR assumption is used, and it is this assumption we carry forward for the purposes of our discussion.

Approaches to Address Missing Data
After data has been collected and the missingness assumption set, it falls on the researcher to determine how to address the missingness. Two of the most common approaches in practice today are those of complete case estimation (CC) and multiple imputation (MI); the latter of which was recommended by the National Research Council in 2010 as one of its preferred means of addressing missing data in clinical trials (Council, 2010). Under CC, all records with missing data are excluded, and treatment effects would be estimated only on fully observed cases. Under MCAR using CC can at least be justified; although, even if MCAR can be substantiated, by throwing out a portion of the data, the sample size shrinks and variance estimators inflate, leading to inaccurate measures of confidence. On the other hand, utilizing CC under MAR is known to be biased (White and Carlin, 2010) from the outset. MI is traditionally recommended for MAR instead.
With MI, the full joint distribution of the data is estimated (either empirically or modeled based off of distributional assumptions), and from this a series of M new imputed data sets is drawn. It is MI's intention to fill in each missing value with M imputed values by sampling from the posterior predictive distribution of the missing value given the observed values. Then, full sample analyses can be applied straightforwardly to each imputed data sets, and these multiple results are summarized by an easy-to-implement combining rule for inference (Rubin, 1987). Under reasonable conditions, MI can be used to obtain unbiased estimates of many parameters of interest. Where it lacks is in its ability to adequately estimate variances for these estimated quantities unless the imputation is proper. Meng (1994) showed that a sufficient condition for imputation to be proper is for the imputation model to be congenial.
Without question, MI is a useful approach for addressing missing data and has been shown to produce valid frequentist inference in a wide range of applications (C. Clogg, Rubin, Schenker, Schultz, and Weidman, 1991). On the one hand, MI has gained popularity in practice because of its intuitive appeal. On the other, Rubin's variance estimator for MI has been shown to not always be consistent (Fay, 1992, Kott, 1995, Fay, 1996, Binder and Sun, 1992, Wang and Robins, 1998, Robins and Wang, 2000, Feodor Nielsen, 2003, Kim, Michael Brick, Fuller, and Kalton, 2006, Yang and Kim, 2016c. Even when the imputation model is correctly specified, Yang and Kim (2016c) showed that MI is not necessarily congenial for method of moments estimation, so some common statistical procedures may be incompatible with MI. From a causal inference perspective, this poses a problem as the validity of Rubin's variance estimator has not been fully explored for many full sample estimation methods used widely in causal inference. Certain otherwise unbiased and consistent full sample causal inference methods (outcome regression, weighting, matching, etc.) may lose these properties when applied in conjunction with MI and MI-produced datasets. Many of the most common estimates for average treatment effects are based off of method of moments estimators and are thusly susceptible to inaccurate variance estimates when using Rubin's variance estimator for MI. For researchers desiring to make causal claims when utilizing MI it is imperative for the variance properties of their estimators to therefore be either validated or an alternative method must be proposed.

Fractional Imputation as an Alternative
As an alternative to CC and MI, there are likelihood-based methods that can be applied. When using these methods, the key insight is that under full confounders, the full sample estimators are obtained by solving estimating equations. In the presence of partially observed confounders, the corresponding estimators can be obtained by solving conditional estimating equations which integrate out the missing confounders given the observed data. There are two difficulties in this approach. First, it requires consistent estimators in the conditional distribution of the missing confounders given the observed data, such as MLE. In the presence of missing values, an EM algorithm is typically used. Second, numerical integration is needed. Integration approximated by imputation under nonignorable missing was considered by many authors, such as Monte Carlo EM method (Wei and Tanner, 1990). For Monte Carlo EM algorithm, in each E-step, the imputed values are regenerated, and thus the computation can be quite heavy. Also the convergence of Monte Carlo sequence of the estimators is not guaranteed for fixed Monte Carlo sample size (Booth and Hobert, 1999).
In practice, EM algorithms may not be feasible when the conditional expectation in the E-step is not available in a closed form. Instead, Fractional imputation (FI) has been proposed to serve as a computational tool for implementing the expectation step (E-step) in the EM algorithm (Wei and Tanner, 1990, Kim, 2011, Yang and Kim, 2016a, which simplifies computation by drawing on importance sampling to obtain the fractional weights and reducing the iterative computation burden over other simulation methods such as Markov Chain Monte Carlo. See Yang, Kim, and Shin (2013a), Yang, Kim, and Zhu (2013b), Kim and Yang (2014), Yang and Kim (2016b) for applications of FI outside of the causal inference context. The main idea in FI is to produce a complete data set by imputation and each imputed value is associated with a fractional weight, by which the observed likelihood can be approximated by the weighted average of the imputed data likelihood. The resulting estimator approximates the maximum likelihood estimator. After data has been fractionally imputed, we can estimate the treatment effect by applying commonly used estimators with fractional weights. Under common regularity conditions, we show that the FI estimators are consistent and asymptotically normal. The remainder of this paper focuses on expanding FI into the causal literature by developing an FI-based method for estimating causal treatment effects. Once developed we will validate the method relative to existing causal inference methods. Specifically we investigate the comparative performance of FI vs MI and CC for estimating causal treatment effects when confounders are subject to missingness.
The rest of the article is organized as follows. We begin in Section 2 with a description of notation and assumptions. In Section 3 we fully outline the FI process as well as derive the resulting variance estimator for the treatment effect estimator. We implement a simulation study in Section 4 comparing the accuracy and precision of treatment effect estimators when the missingness is addressed by CC, MI, and FI. Section 5 provides a demonstration of FI's utilization with a real-world health dataset. Finally, in Section 6, we end with a discussion of the results and of the implication they have on current and future causal work.

Treatment Effect Estimation Notation
Following Neyman (1923) and Rubin (1974) we use the potential outcomes framework. The treatment is denoted by A i ∈ {0, 1}, where 0 and 1 are labels for control and treatment respectively. For each subject i, define a pair of potential outcomes {Y i (1),Y i (0)} which represent the outcomes if the subject was treated, Y i (1), and if he or she was not, Y i (0). Implicit in this notation, we make the stable unit treatment value assumption (Rubin, 1978). The observed outcome for subject i is then Y i = Y i (A i ). Let X i be the vector of confounders for subject i. We assume that are independent and identically distributed. The conditional treatment effect is τ(X) = E{Y (1) − Y (0) | X}, and the average treatment effect is τ 0 = E{τ(X)}. The average treatment effect cannot be estimated without further assumptions, because for each subject only one potential outcome is observed. The common assumptions for identifying the average treatment effect (Rosenbaum and Rubin, 1983) are as follows: Assumption 1 (Ignorability) Y (a) ⊥ ⊥ A | X for a = 0, 1, where ⊥ ⊥ means "is conditionally independent of". Assumption 2 (Sufficient overlap) With probability 1, 0 < c 1 ≤ e(X) ≤ c 2 < 1, where e(X) = pr(A = 1 | X) is the propensity score.
Under Assumption 1, adjusting for covariates X creates a randomization-like scenario and removes confounding biases brought on by treatment selection. In practice, these covariates are often of high dimension. Alternatively, the propensity score has been proposed as a one-dimensional summary of X (Rosenbaum and Rubin, 1983). The central role of the propensity score lies in the fact that Assumption 1 implies Y (a) ⊥ ⊥ A | e(X) for a = 0, 1. Therefore, adjusting for the propensity score alone can remove confounding biases

Treatment Effect Estimation Under Fully Observed Data
In fact, reliance on the propensity score comes about naturally when we decompose the joint density of X, A, and Y into three particular components. Specifically Based off of this decomposition, a number of propensity score based estimators have been proposed for estimating the treatment effect including propensity score matching, subclassification, or weighting. See Imbens and Rubin (2015) for a textbook discussion. If we limit the class of propensity estimators to only parametric estimates (of the form e(X | θ ) where θ is the vector of parameters used to estimate the propensity score), the two most common estimates of τ from this class are that of Inverse Propensity Weighting (IPW) and Augmented IPW (AIPW). Both are illustrated here as examples.
In both examples, let e(X |θ ) be the estimated propensity score where θ has been estimated by some consistent estimatorθ . In practice, e(X |θ ) is typically a logistic regression model.

Example 1. IPW Estimation
The IPW estimate of τ 0 iŝ We adopt the U-estimation convention where we let U(τ; X, A,Y | η) be the estimating function for τ 0 under a given set of nuisance parameters η . An unbiased estimate for τ 0 can then be derived as the solution to P n U(τ; X, A,Y |η) = 0 whereη is a consistent estimator of η and P n is the empirical measure; namely As examples, estimating functions for IPW and AIPW are shown below.
Example 3. The estimating function ofτ IPW is Hereη = (θ ), the parameter estimates used to calculate the propensity scores.

Remark 1 (Doubly Robust Estimation)
The IPW estimate in Example 3 does not need to model the outcome Y , but it does require a correct model for e(X | θ ).
On the other hand, the AIPW estimate for τ 0 obtained from the mean estimating function in Example 4, incorporates a double robust (DR) feature for estimation. That is, U AIPW is an unbiased estimating function for τ 0 if either e(X | θ ) or µ(a, X | β ) is correctly specified.

Missing Data Notation
Whereas all of the above discussion holds under fully observed responses and confounders, in this article we consider the case where X contains missing values. To that end, let R be a collection of indicator variables R = (R 1 , ..., R p ) corresponding to (X 1 , ..., X p ) where R ji = 1 indicates that X j is observed for subject i and R ji = 0 indicates X j is not observed for subject i. Collecting a set of observed indicator vectors results in a missingness pattern denoted by R. This short hand allows us to decompose our data into X = (X obs , X mis ) If we return to equation (1) and incorporate the parameterizations laid out in examples 1 and 2, the decomposition of the joint distribution can be naturally extended to account for missingness. To do so, we rewrite the joint distribution as: where α is the collection of parameters used to estimate the missing portion of X given the observed portion of X, and ρ is the collection of parameters used in describing your missingness mechanism. Finally, in this article we are only interested in the case where X mis follows a MAR pattern, which leads us to our final assumption which completes the basis for our first conclusion (the proof for which is made available in the appendix).
Assumption 3 (Missing at random) R ⊥ ⊥ X mis | X obs , A,Y Theorem 1 Under Assumptions 1-3, τ 0 is identifiable from the observed data.
A convenient consequence of adopting an MAR framework is that because of Assumption 3, any expectations taken with respect to X mis will result in f (R|X, A, Y ; ρ) falling out of the decomposition; therefore for the remainder of this article, unless otherwise noted, we will be suppressing inclusion of an explicit missingness mechanism term and instead using the decomposition:

Treatment Effect Estimation Under MAR
In the presence of missingness, we can let U(τ; We term U(τ; X obs , A,Y |η) the "mean estimating function" of τ 0 given the observed data. Note that in defining U(τ; X obs , A,Y |η),η must now also expand to include any new parametersα utilized in estimating E(X mis | X obs , A,Y ; α). From Theorem 1, under Assumptions 1-3, U(τ; X obs , A,Y |η) is an unbiased estimating function of τ 0 . Therefore, a consistent estimator of τ 0 can be obtained by solving P n U(τ; X obs , A,Y |η) = 0.
Remark 2 (Doubly Robust Estimation Under MAR) As with IPW estimation under fully observed data, IPW estimation under MAR does not need to model the outcome Y , but it does now require correct models for both e(X | θ ) and f (X mis | X obs , A,Y ; α). On the other hand, the AIPW estimate for τ 0 obtained from its the mean estimating function under MAR, needs only one of either e(X | θ ) or µ(a, X | β ) to be correctly specified, provided f (X mis | X obs , A,Y ; α) is correct, to be unbiased. The DR feature of AIPW estimation for treatment effects has been shown extensively in full data situations and more recently in the case where confounders are MAR (Zhang, Liu, Zhang, Tang, and Zhang, 2016).

Fractional Imputation
Estimation of treatment effects under fully observed data is straight-forward; unfortunately fully observed data is rarely encountered in practice. Imputation methods are often used to facilitate estimation in the presence of missing values by completing the partially observed portions of the data and coaxing a "full" dataset out of a partial one. It is important to note, though, imputation methods only provide a means of addressing the missingness and complete the dataset without heed given to effect estimation. Therefore, to examine the consequences of choosing a specific imputation method, we propose a two-stage procedure. In the first stage, referred to as the design stage, we use an imputation technique to fill in the missing covariate values and estimate the propensity scores. In the second stage, referred to as the analysis stage, classical propensity score techniques are applied to estimate the causal parameters. See Rubin (2007), Rubin (2008), and Stuart (2010) for the mention of decomposing causal inference into two different stages. This framework will be used for both FI and MI methods. The former of these methods we discuss in more detail here. For a more detailed examination of MI see Rubin (1987).

Implementing Fractional Imputation
For illustration, consider the case where X contains only two variables, X = (X 1 , X 2 ), where X 1 is fully observed and X 2 is subject to missingness. Let R 2 be the response indicator of X 2 . From Examples 3 and 4, we can obtain an estimator of τ 0 by solving the mean estimating equation In (6), the conditional expectation E{U(τ; X i , A i ,Y i |η) | X obs,i , A i ,Y i } is often difficult to obtain. The basic idea of FI is to overcome this difficulty by creating a weighted set Remark 3 Only records where R 2i = 0 need imputed, so when utilizing FI, only these observations require a weight ω * i j to be calculated in the weighted set {(ω * i j , .., M}. However, implicit in this representation is the generation of weights ω i = 1 for observations where R 2i = 1. While notation for such implicit generation is suppressed in this article, if desired, the weighted set can be rewritten as Such notation may be beneficial if equational symmetry is desired, though the FI process and resulting estimates of τ 0 are unaffected.
Toward that goal of approximating E{U(τ; , notice that the last conditional expectation in (6) can be written as where f (X 1 , X 2 , A,Y |η) is the joint distribution of (X 1 , X 2 , A,Y ) with nuisance parameters η set toη. Furthermore, the joint distribution can be decomposed similar as in (5) to be where we assume f (X 2 | X 1 ;α) and f (Y | X 1 , X 2 , A;β ) be correctly specified as f (X 2 | X 1 ; α) and f (Y | X 1 , X 2 , A; β ), respectively. Under complete response, the maximum likelihood estimator (MLE) of θ can be obtained as a solution to the score equations, where S(θ ; X, A) is the score function of θ and can be written as S(θ ; (Louis, 1982, Pfeffermann, Skinner, Holmes, Goldstein, andRasbash, 1998), which under missingness is rewritten MLE estimates ofα andβ can be obtained similarly to (8) for their respective mean score equations.
Forβ the mean score equation is is the score function of β and can be written as S(β ; To obtain the solution to (6), (8), (9)and (10), the EM algorithm can be applied. To do so using FI, the following process can be implemented: Step 0. Let the initial values for parameters be set to α (0) , β (0) , and θ (0) which are the MLE of α, β , and θ using only complete cases. For each unit i with R 2i = 0, generate M imputed values of X 2i , denoted by x * ( j) 2i ( j = 1, . . . , M), from a proposal distribution h(x 2 ), e.g. f (x 2 | X 1 ; α (0) ).

Remark 4 Recall under MARα can be obtained under only complete cases.
In such case, α (0) =α, and there is no need to update the parameter estimate α (t) each iteration. Additionally, if h(x 2 ) = f (x 2 | X 1 ;α), the calculation of the weight function simplifies to ω * (t) The simplification is not necessary under MAR, but we mention it here for the event when additional computational resource efficiencies are desired for a particular application. Letη = (α,β ,θ ) be the resulting estimates for the nuisance parameters. Note that at each EM iteration, imputed values of X 2 are not changed; only fractional weights are updated for each iteration. The weights ω * i j , obtained at the end of the EM iteration, assigned to imputed values can be called fractional weights. The fractional weight represents a similarity measure between the imputed value and the missing value.
By incorporating these weights, the conditional estimating equation for τ O can be approximated by andτ can be obtained by solving this imputed estimating equation for τ. Here, U(τ; X, A,Y |η) can be either the IPW or AIPW estimating function.

Asymptotic Results
Becauseτ is obtained through the method of estimating equations, we establish the asymptotic properties ofτ, in a manner similar to Robins and Wang (2000).
Theorem 2 Let η = (α, β , θ ) be a vector of the nuisance parameters, and letη = (α,β ,θ ) be the vector of corresponding MLE estimators converging in probability to η 0 = (α 0 , β 0 , θ 0 ), the true values of the nuisance parameters. Under certain regularity conditions, the solutions to (6),τ, is consistent for τ 0 and satisfies Note thatτ andη satisfy U(τ |η) = 0 as M → ∞. By use of Taylor expansions we can study the asymptotic properties ofτ. First, by a Taylor expansion of U(τ |η) aboutη = η 0 we obtain To expressη − η 0 from (11) further, we note that the EM algorithm leads to the MLE of η 0 and thereforeη satisfies which depends onη in two places, namely S(η; X i , A i ,Y i ) and the conditional expectation taken with respect to f (X mis | X obs , A,Y ;η). Applying another Taylor expansion aboutη = η 0 this time in (13) leads to Therefore we can expresŝ Combining (12) and (15), ignoring the small order terms, (11) can be expressed in a linear form: with the l being used to denote the linearization.
Second, note that we now have U(τ |η) = U l (τ | η 0 ) + o p √ n −1 . We apply another Taylor expansion, this time on U l (τ | η 0 ) aboutτ = τ 0 , and we obtain Because E {S obs (η 0 ; X obs , A,Y )} = 0, the first term simplifies as Lastly, if we combine all the results above, we obtain an asymptotic linearization ofτ − τ 0 aŝ with κ and λ defined as in the above theorem. Therefore, the asymptotic variance of √ n(τ − τ 0 ) is which completes the proof.
As a result of Theorem 2, we can obtain a consistance variance estimator of τ. Defineλ andκ as empirical versions of λ and κ, respectively. For exam- Then we can estimate the variance ofτ using a sandwich formula

Variance Estimation
Because of how data is imputed under FI, we can obtain further simplification when using IPW or AIPW. The λ terms cancels since ∂ ∂ τ U(τ 0 ; X, A,Y ) = −1 under either estimator, and it can be shown that the κS obs (η 0 ) term falls out of Ω for observations where X 2 is observed. However, even after these mild simplifications, it is still apparent that exact variance estimates will be difficult to obtain. Only in rare situations will the derivatives of the score functions be anything other than impractical to calculate. Thankfully, Theorem 2 suggests large sample approximation is appropriate, and a bootstrap or jackknife estimator can serve as a more practical alternative.

Simulation Study
In the current causal inference literature, there has not been a side-by-side comparison of FI and MI with respect to how they perform estimating average treatment effects or of their corresponding variance estimates in the same setting. To examine how FI performs compared to MI, we adapt a simulation set up previously used by Lunceford and Davidian (2004). We modify it for our purposes by removing the unmeasured confounders and adding a MAR framework. Utilizing this adapted setup, we examine the bias and variance properties of FI compared to two different MI implementations as well as CC estimation. Estimation of τ 0 under fully observed data is also conducted as a reference point.

Simulation Setup
Let X=(X 1 , X 2 , X 3 ) be our confounders associated with treatment effect where X 3 ∼Bernoulli(0.2) and (X 1 , X 2 ) are multivariate normal N(µ X 3 , Σ X 3 ) conditional on X 3 where Once X has been created, we simulate propensity scores e(X, θ ) = {1+exp(0.3+ 0.2X 1 − 0.1X 2 − 0.1X 3 )} −1 and draw A i from a Bernoulli(e(X i , θ ) distribution. We've chosen θ in such a way so as to ensure the starting propensity scores are well behaved (ie: between 0.1 and 0.9) and would satisfy the sufficient overlap assumption (Assumption 2).
We finally simulate Y = Xβ + A∆ + A * Xξ =−X 1 + X 2 − X 3 + 2A + 0.5A * X 1 + 0.25A * X 2 . The addition of the A * X interaction terms was implemented to better match the simulation data to what is observed in practice. Note that because X 1 and X 2 are mean 0, these terms fall out in expectation; however to approximate the true value of τ 0 , both Y (1) and Y (0) were evaluated for all records, with τ 0 being calculated as τ 0 = n −1 (∑ n i=1 Y (1) − Y (0)). All results concerning bias and coverage were then calculated for each sample relative to this within-sample τ 0 .
Once the complete data was created, we then had to simulate our missingness. We calculated Φ = P(R = 0|A, X 1 , X 3 ,Y ) = [1+exp(0.25+0.25X 1 −0.6X 3 + 0.5A + 0.4Y )] −1 and generated R i from a Bernoulli(Φ i ) distribution. The coefficient values of Φ were selected to approximate a missingness rate of 0.33 while at the same time ensuring sufficient treatment and control counts in both the missing and nonmissing records. Averaging across all simulation datasets, the resulting distribution of missingness (in X 2 ) to Treatment assignment was: After missingness was assigned, a new X * 2 variable was constructed where This created our working data set Z = [X 1 , X * 2 , X 3 , A,Y, R]. Our next step was to impute the missing values of X * 2 . To do sounder FI, we regressed X 2 on X 1 and X 3 using only the complete cases of Z as X 2 = α 0 + α 1 X 1 + α 2 X 3 . From this regression we obtained initial values of α as well as the standard error of X 2 which we will callα and σ (init) X 2 respectively. Usingα we calculate a mean vector X (µ) 2 =α 0 +α 1 X 1 +α 2 X 3 and then generate M=200 imputed values for any records missing X * 2 as X * 2;i j = X (µ) wheret was drawn from a t(4) distribution. We then calculated our h function as h X 4) is the inverse of the density function of a t(4) distribution evaluated at ( * ).
Once our h function was computed, we then could start our FI loop as described in section 3.1 above. To get estimates of β we ran separate regressions for Y depending on the value of A. The imputation loop continued until either 250 iterations were reached or all parameters converged within 1x10 −6 . It took on average 32 iterations for each simulation's FI loop to converge .
The FI loop produced our end weights which we could then use to calculate our IPW and AIPW estimates for τ 0 under FI. We compared these estimates versus CC and MI estimates. For CC, the process was straight forward. For MI we used the mi package. Since there is still debate within the literature about whether or not including a fully observed response variable when imputing partially observed covariates is better and/or appropriate (Moons, Donders, Stijnen, and Harrell, 2006, Sterne, White, Carlin, Spratt, Royston, Kenward, Wood, and Carpenter, 2009, Nguyen, Carlin, and Lee, 2017, we calculated MI results under both scenarios. Multiple imputation with response Y included is labeled as (MI1), and without is labeled as (MI2). Lastly we calculatedτ IPW andτ AIPW estimates utilizing the full data (as if no missingness had been introduced).
This process was run 2000 times. To estimate variances in each iteration for each method, a leave-10-out jackknife was used.

Simulation Results
Once all simulations had been run for all methods, the results were as follows: Similar results were seen for the AIPW estimates: As expected, both MI and FI do better than CC estimators in all regards. It was surprising to see how striking an impact the inclusion/exclusion of the response variable Y in the imputation step had on MI. Comparing FI to the champion of those two approaches (MI1), FI and MI had comparable bias, but FI won out in terms of coverage. The FI coverage is much more in-line with the nominal 95% coverage used in the confidence interval calculation. Multiple imputation (as viewed as MI1) saw over coverage compared to the full data results. This can be attributed to an inflated standard error.

Demonstration of Method Using Smoking Data
To illustrate the application of our method on real data, we turn to a data set from the U.S. National Health and Nutrition Examination Survey. The data set at hand investigates the effect of cigarette smoking on blood lead levels with age, gender, race, education and income used as confounders. Of the confounders only income was subject to missingness at a rate of 8.5% overall (6.0% smokers, 9.2% non-smokers). In the original dataset, missing income values were imputed using mean imputation. As single imputation methods (such as mean imputation) methods are known to be biased under MAR (Council, 2010), we investigate how the estimates change from the original data under MI and FI. CC was also examined for a consistent point of reference.
As in our simulation we will need to model our propensity scores, as well as regress both our Y (lead level) on all confounders as well as our missing confounder (income) on all present confounders. With respect to our regression for income, we built a model of the form X inc = α 0 + α 1 X age + α 2 X male + α edu X edu + α race X race where α edu and X edu represent the parameter estimates and data for the dummy variables that represent the 6 education levels (similarly for α race and X race for its 5 levels of race). The dummy variables for unknown education and other race were excluded as they were, by construction, linearly dependent on the other columns in their group. This regression also gave us an estimate of σ (init) X inc . As before, this was used to create M=200 imputed data values for each of the 285 missing cases drawn as X * inc;i j = X (µ) inc;i +tσ (init) X inc wheret was drawn from a t(4) distribution. h X inc was calculated similarly as before. A leave-10-out jackknife was still used to estimate variances.
The resulting estimates and the accompanying variance estimates for τ are as follows: While we can not know the true values of income from which we could calculate our bias, an examination of resource utilization does prove useful. It is expected that CC would take minimal time since it does not need to attempt any convergence loop. FI takes significantly longer, but we would argue the increase in time is worth the added asymptotic bias advantages. What is conspicuously absent from the results table is any output related to MI. This is because the MI algorithm never converged. The MI process had to be stopped manually without obtaining any results. We believe that if MI was run long enough, results could be obtained, but after eight hours of no convergence, the resource gains were clear.
As to why this resource gain exists for FI over MI, we postulate that since FI does not have to model and redraw from the full distribution of X every iteration, it is able to arrive at estimates of τ 0 and standard errors much more quickly than MI.

Summary and Future Work
Via simulation study and live application we have demonstrated that FI is an effective method for addressing missingness in covariates when estimating average treatment effects under the condition that covariates are MAR. Moreover, we were able to demonstrate FI's superiority over the existing leading methods of MI and the known faulty CC. FI had better bias, variance, and coverage properties than either MI or CC. What is more, we showed that when deployed in a live setting, FI is less resource intensive than MI based methods, most likely due to the lack of need to estimate the full covariate distribution.
With these results in mind, it is worth noting a comment made by Rubin in his 18 year retrospective of his original work intruding MI (Rubin, 1996). As an initial defense to then contemporary critiques of the method, some of which have been cited here already (see Fay (1992) and Meng (1994)), Rubin offered up the response that in cases where randomization validity (ie: actual confidence coverage=nominal interval coverage) is difficult to achieve, statisticians should alternatively seek confidence validity (actual interval coverage ≥ nominal interval coverage) with decisions between competing methods decided by which method has the shortest interval. Near the end of that defense the reader can find the following comment: Of course, if we have a procedure that is confidence valid but not randomization valid, there is hope that a better confidence-valid procedure exists (i.e., one with shorter intervals), which is also randomization valid, but in general this is not achievable.
It is our belief that the results above demonstrate, at least so far as within the realm of estimating population average treatment effects, FI offers the hoped for better procedure Rubin proposed might exist over MI. We would invite other researchers to expand upon these findings and explore the possible benefits FI has in the field of causal inference, particularly as it relates to drawing conclusions from a potentially randomization valid alternative to MI.
In our own future work, we will look to explore FI's impact on calculating causal treatment effect estimates when the partially observed covariate(s) are non-gaussian. While it is known FI can be expanded into the case where multiple covariates are missing and where complex missingness patterns are observed, it is not known how the treatment effects estimated from such imputed data sets will perform. Furthermore, the algorithm developed in Section 3.1 can be easily extended into the MNAR setting by the inclusion of a model for the missingness and replacing the likelihood there with the full likelihood from equation (4). We intend to investigate the extent to which we can relax the MAR assumption for FI and see if consistent and efficient estimators for τ 0 can still be obtained when confounders are MNAR. Finally, we would like to explore FI's potential uses in other causal inference methods for treatment effect estimation beyond weighting methods, particularly as it applies to matching methods.  Our results demonstrate no loss of accuracy between FI and MI1 regardless of the setting of M. There also is no gain in accuracy among any of the methods with increasing M. This matches with existing literature that if only point estimation is of interest then low M settings are sufficient for MI. Our results further suggest FI can also permit low M settings when variance estimation is not of interest. However, all methods perform poorly with respect to coverage until M has at least surpassed 20. As Graham et al. (2007) and von Hippel (2016) suggest, low M values are not sufficient for accurate variance estimation. Moreover, if focusing on MI1, the coverage gains stop somewhere between M = 20 and M = 50 (again, as expected), but the MI1 coverage is still about 2% higher than the nominal coverage.
As for FI, it takes slightly higher M to surpass the coverage potential of MI1. At M = 100, FI coverage was within 1% of nominal coverage with gains still being seen as M increased to the settings used in our simulations. From these results, we conclude M = 200 was a sufficient setting for our simulations though even better coverage may have been attainable with higher M. Furthermore, if computational resources are limited setting M to only 100 may be a passable settingstill superior to MI but not yet reaching approximate asymptotic properties. It is also important to note that these sensitivity results are only confirmatory for our simulation settings and may differ when FI and MI are applied in more complex settings or when there is a higher level of missingness. As such, we recommend plotting coverage curves like these when deploying either method in future applications to validate M has been set sufficiently high in those situations.