Optimal Balancing of Time-Dependent Confounders for Marginal Structural Models

Marginal structural models (MSMs) estimate the causal effect of a time-varying treatment in the presence of time-dependent confounding via weighted regression. The standard approach of using inverse probability of treatment weighting (IPTW) can lead to high-variance estimates due to extreme weights and be sensitive to model misspecification. Various methods have been proposed to partially address this, including truncation and stabilized-IPTW to temper extreme weights and covariate balancing propensity score (CBPS) to address treatment model misspecification. In this paper, we present Kernel Optimal Weighting (KOW), a convex-optimization-based approach that finds weights for fitting the MSM that optimally balance time-dependent confounders while simultaneously controlling for precision, directly addressing the above limitations. KOW directly minimizes the error in estimation due to time-dependent confounding via a new decomposition as a functional. We further extend KOW to control for informative censoring. We evaluate the performance of KOW in a simulation study, comparing it with IPTW, stabilized-IPTW, and CBPS. We demonstrate the use of KOW in studying the effect of treatment initiation on time-to-death among people living with HIV and the effect of negative advertising on elections in the United States.


Introduction
Marginal structural models (MSMs) o↵er a successful way to estimate the causal e↵ect of a time-varying treatment on an outcome of interest from longitudinal data in observational studies (Robins, 2000;Robins et al., 2000). For example, the HIV-Causal Collaboration (2011) used an MSM to estimate the optimal timing of HIV treatment initiation, Hernán et al. (2008) to evaluate the e↵ect of hormone therapy on cardiovascular outcomes, and Blackwell (2013) to evaluate the impact of negative advertising on election outcomes. The increasing popularity of MSMs among applied researchers derives from their ability to control for time-dependent confounders, which are confounders that are both a↵ected by previous treatment and a↵ect future ones. In particular, as shown by Robins et al. (2000) and Blackwell (2013), standard methods, such as regression or matching, fail to control for time-dependent confounding, introducing post-treatment bias. In contrast, MSMs consistently estimate the causal e↵ect of a time-varying treatment via inverse probability of treatment weighting (IPTW), which controls for time-dependent confounding by weighting each subject under study by the inverse of their probability of being treated given covariates, i.e., the propensity score (Rosenbaum and Rubin, 1983), mimicking a sequential randomized trial. In other words, IPTW creates a hypothetical pseudo-population where time-dependent confounders are balanced over time.
Despite their wide use and theoretical appeal, these methods rely considerably on positivity, requiring that, at each time period, the probability of being assigned to the treatment, conditional on the history of treatment and confounders so far, is not 0 or 1 (Robins, 2000). Violations of positivity or even just small nonzero probabilities (small relative to sample size), lead to extreme and unstable weights, which in turn yield very low precision (Kang and Schafer, 2007;Robins et al., 1995;Scharfstein et al., 1999). In addition, MSMs using IPTW are highly sensitive to misspecification of the treatment assignment model, which can lead to biased estimates (Kang and Schafer, 2007;Lefebvre et al., 2008;Cole and Hernán, 2008).
Various statistical methods have been proposed in an attempt to overcome these challenges. To deal with extreme weights, several authors (Cole and Hernán, 2008;Xiao et al., 2013) have suggested truncation, whereby outlying weights are replaced with less extreme ones through clipping. Santacatterina et al. (2018) proposed to use shrinkage instead of truncation as a more direct way to control the bias-variance trade-o↵ it o↵ers. Robins et al. (2000) recommended the use of stabilized-IPTW (sIPTW) where inverse probability weights are normalized by the marginal probability of treatment. To control for misspecification of the treatment assignment model, Imai and Ratkovic (2015) proposed to use the covariate balance propensity score (CBPS), which instead of plugging in a logistic regression estimate of propensity into IPTW finds the logistic model that balances covariates via the generalized method of moments. The method tries to balance the first moment of each covariate even if a logistic model is misspecified (Imai and Ratkovic, 2014). Although these methods either make the treatment model more robust to misspecification or trade accuracy for precision by truncating, normalizing, or shrinking plug-in weights, none of them fully balance covariates optimally to directly remove error due to confounders while simultaneously controlling precision.
In this paper, we present Kernel Optimal Weighting (KOW), which provides weights for fitting an MSM that optimally balance time-dependent confounders while controlling for precision. Specifically, the proposed method directly minimizes the error due to timedependent confounding while penalizing extreme weights. This error is expressed as the sum of discrepancies between the weighted observed data and the counterfactual of interest over all treatment regimes. By decomposing this error as a functional on certain conditional expectation functions, KOW minimizes its operator norm with respect to a reproducing kernel Hilbert spaces (RKHS) while penalizing extreme weights by solving a quadratic optimization problem over weights. This extends the kernel optimal matching method of Kallus (2016) to the longitudinal setting and to dealing with time-dependent confounders, where, similarly to regression and matching, it cannot be applied without introducing posttreatment bias.
The proposed method has several attractive characteristics. First, it minimizes the error due to time-dependent confounding by optimally balancing covariates and simultaneously controls for precision by penalizing the weights, which leads to better accuracy, precision, and total error. In particular, in the simulation study presented in Section 5, we show that the mean squared error (MSE) of the estimated e↵ect of a time-varying treatment obtained by using KOW is lower than that obtained by using IPTW, sIPTW, and CBPS in all considered simulated scenarios. Second, by optimally balancing covariates, KOW mitigates the e↵ects of possible misspecification of the treatment model. In Section 5, we show that KOW is more robust to model misspecification compared with the other methods.
Third, unlike CBPS (Imai and Ratkovic, 2015), KOW can balance non-additive covariate relationships by using kernels, which generalize the structure of conditional expectation functions, and does not restrict weights to follow a fixed logistic (or other parametric) form.
In Section 5, we show how KOW compares favorably with the aforementioned methods in all nonlinear scenarios, and in Section 7.2 we use KOW to balance non-additive covariate relationships estimating the e↵ect of negative advertising on election outcomes. Fourth, KOW can be easily generalized to other settings, such as informative censoring. We do just that in Section 6 and, in Section 7.1, we use this extension to study the e↵ect of human immunodeficiency virus (HIV) treatment on time to death among people living with HIV.
Fifth, di↵erently from Imai and Ratkovic (2015), where the number of covariate balancing conditions grows exponentially in the number of time periods, KOW only needs to minimize a number of imbalances that grows linearly in the number of time periods. This feature leads to a lower computational time of KOW compared with CBPS when the total number of time periods increases, as shown in our simulation study in Section 5.2.3 and in our study on the e↵ect of negative advertising on election outcomes in Section 7.2.
In the next section, we briefly introduce the literature of MSMs (Section 2). In Section 3 we develop and define KOW. We then discuss some practical guidelines on the use of KOW (Section 4). In Section 5 we report the results of a simulation study aimed at comparing KOW with IPTW, sIPTW, and CBPS. In Section 6, we extend KOW to control for informative censoring. We then present two empirical applications of KOW in medicine and political science (Section 7). We o↵er some concluding remarks in Section 8.

Marginal structural models for longitudinal data
In this section, we briefly review MSMs (Robins, 2000;Robins et al., 2000). Suppose we have a simple random sample with replacement of size n from a population. For each unit i = 1, . . . , n and time period t = 1, . . . , T , we observe the binary time-varying treatment variable A it , with A it = 0 meaning not being treated at time t and A it = 1 being treated at time t, and time-dependent confounders X it . We denote by A it = {A i1 , . . . , A it } the treatment history up to time t and by X it = {X i1 , . . . , X it } the history of confounders up to time t. X i1 represents the time-invariant confounders, i.e., confounders that do not depend on past treatments. For each unit i = 1, . . . , n, we denote by Y i the outcome variable observed at the end of the study. Using the potential outcome framework (Imbens and Rubin, 2015), we denote by Y i (a) the potential outcome we would see if we were to apply the treatment regime a 2 A to the i th unit, where A = {0, 1} T is the space of treatment 5 regimes. We impose the assumptions of consistency, non-interference, positivity and sequential ignorability (Imbens and Rubin, 2015;Hernán and Robins, 2010). Consistency and non-interference (also known as SUTVA; Rubin, 1980) can be encapsulated in that the potential outcomes are well-defined and the observed outcome corresponds to the potential outcome of the treatment regime applied to that unit, i.e., Y i = Y i (A i ). As previously introduced, positivity states that, for each time t = 1, . . . , T , the probability of being treated at time t conditioned on the treatment history up to time t 1 and the confounder history up to time t, is not 0 or 1, i.e., (1) Sequential ignorability states that the potential outcome Y i (a) is independent of treatment assignment at time t, given the treatment history up to time t 1 and the confounder history up to time t. Formally, sequential ignorability is defined as An MSM is a model for the marginal causal e↵ect of a time-varying treatment regime on the mean of Y i , that is, where g(a, ) is some known function class parametrized by . For example, a commonly used MSM is based on additive e↵ects with a common coe cient: g(a, ) = 1 + 2 P T t=1 a t , where the parameter 2 is the causal parameter of interest. The standard way to fit an MSM, as introduced by Robins (2000); Robins et al. (2000), first estimates weights that account for the non-randomness of treatment and then uses these to fit a regression on the treatment regime alone using weighted least squares (WLS). The set of inverse probability 6 weights and stabilized weights are defined as follows, where h i (a t ) is a known function of treatment history. The set of inverse probability weights is obtained by setting h i (a t ) = 1, while the set of stabilized inverse probability weights is obtained by setting h i (a t ) = P(A it = a t | A i,t 1 = a t 1 ). To estimate weights of the form of eq. (4), one first estimates the conditional probability models using either parametric methods such as logistic regression or methods based on statistical learning Gruber et al., 2015;Karim and Platt, 2017) and then these estimates are plugged in directly into eq. (4) to derive weights. Stabilized weights seek to attenuate the variability of inverse probability weights by normalizing them by the marginal probability of treatment. Since the additional factor is a function of treatment regime alone, it does not a↵ect the consistency of the WLS if the MSM is well specified. Both sets of weights, however, rely on plugging in an estimate of a probability into the denominator, meaning that when the true probability is even modestly close to 0, any small error in estimating it can translate to very large errors in estimating the weights and to estimated weights that are extremely variable. Di↵erently than inverse probability weights, Imai and Ratkovic (2015) proposed to estimate weights by generalizing the covariate balancing propensity score (CBPS) methodology. Instead of plugging in probability estimates based on logistic regression, CBPS uses the generalized method of moment to find the logistic regression model that if plugged in would lead to weights that approximately solve a subset of the moment conditions that the true inverse probability weights, eq. (4), satisfy.
In all cases, once some appropriate weights W are determined, an MSM is fit by a weighted regression of outcomes on the treatment histories, usually using WLS. For example, to fit additive e↵ects with a common coe cient, one simply minimizes P n i=1 W i (Y i 1 2 P T t=1 A it ) 2 and reports the resulting 2 . Robins (2000); Hernán et al. (2001) then carry out inference on the MSM by constructing a Wald confidence interval using WLS robust (sandwich) standard errors (Freedman, 2006).

Kernel Optimal Weighting
In this paper, di↵erently from IPTW, sIPTW and CBPS, we propose weights that fully optimize balance of time-invariant and time-dependent confounders over all weights to directly address error due to imbalances while controlling precision. We begin by decomposing the weighted average outcome among the observed a-treated units. This new decomposition generalizes the standard cross-sectional decomposition (Kallus, 2016, Thm. 1) to the longitudinal setting with time-dependent confounders. Based on this decomposition, we define imbalance as the sum of absolute discrepancies of empirical means of certain conditional expectation functions between the original sample and the a-treated units in the weighted sample. We then consider the normalized squared worst case imbalance, or equivalently the operator norm of this imbalance expressed as a functional on functions ranging in an RKHS. We finally minimize this quantity while controlling for precision by using quadratic optimization. To fix ideas, we begin by presenting the proposed methodology with two time periods, T = 2.

Defining imbalance
Given some sample weights, we wish to quantify the discrepancy between the weighted observed data and the counterfactuals of interest. For the sake of clarity, our presentation will focus on population averages in decomposing this discrepancy. Toward that end, consider any set of population weights W that, like eq. (4), depend on A T , X T . Assuming 8 consistency and assumptions (1)-(2), for each a = (a 1 , a 2 ) 2 A, we decompose the weighted average outcome among the a-treated units as, A more detailed version of eq. (5) is provided in the appendix. This shows that we can write the di↵erence between the weighted average outcome among the a-treated units, a 2 (W, g a ) arê (2) Based on this, we define imbalance as the sum of absolute empirical discrepancies over all treatment regimes:

Normalized squared worst case imbalance
One way to control imbalance in eq. (8) is by using inverse probability weights, which (if known) would set the population-level di↵erences in eq. (5) to zero. Di↵erently, we find weights that directly minimize the imbalance over all possible realizations of the unknown functions g (1) a and g (2) a on which this quantity depends. Since the imbalance scales homogeneously with these functions, we want to limit their "size," or equivalently minimize the imbalance relative to "size." Let us define an extended seminorm k·k (1) on the space of func- a (X 1 ) and an extended seminorm k · k (2) on the space of functions g a ) and (2) a ) and note that (t) at (W ) depends only on the treatment at time t and not the whole treatment regimen. Then the normalized squared worst case imbalance is defined as The last equality arises from the self-duality of Euclidean norm. What is important to note is that this shows that the imbalances of interest are essentially the same regardless of the particular treatment regime trajectory a. That is, to control the imbalances for all trajectories, at any time point t, we are only concerned with the imbalances of histories A t 1 , X t for those units treated at time t, A t = 1, and for those not, A t = 0. So, while the number of treatment regimes grows exponentially in the number of periods, we need only to keep track of and minimize a number of imbalances growing linearly in the number of periods T . By eliminating each of these linearly-many imbalances, any time-dependent confounding would necessarily be removed, as shown by eq. (5). In Section 5.2.3, we show how this feature also translates to favorable computational time when dealing with many time periods.
Next, we show how to generalize this idea to more than one time periods, letting T > 1 so the space of treatment regimes is A = {0, 1} T . In addition to the definitions in eqs. (6)- Then an immediate generalization of eq. (5) yields that We therefore define imbalance generally as and, the normalized squared worst case imbalance becomes This again highlights that there are only linearly-many imbalances that we need to account for: two for each time period.

Minimizing imbalance while controlling precision
We can obtain minimal imbalance by minimizing B 2 (W ). However, to control for the possibility of extreme weights and to directly address the total estimation error, we propose to regularize the weights in such a way that the precision of the resulting MSM, which involves the variability of the weights, is controlled. We therefore wish to find weights that minimizes B 2 (W ) plus a penalty. Formally, we want to solve where e is the vector of ones and W = {W : W i 0 8i} is the space of nonnegative weights W . The squared distance of the weights from uniform weights here serves as a convex surrogate for the variance of the resulting MSM (assuming homoskedasticity or bounded residual variances) and in eq. (13) can be interpreted as a penalization parameter that controls the trade o↵ between bias and variance. When is equal to zero, the obtained weights provide minimal imbalance. When ! 1, the weights become uniformly distributed leading to an ordinary least squares estimator for the MSM.

Kernels and quadratic optimization to optimally balance timedependent confounders
In this section, we use an RKHS to specify the norm that specified the worst-case discrep- at (W ), which we derived in the previous two sections, and then show how we can use quadratic optimization to obtain the weights that optimize eq. (13). Kernels are widely used in machine learning to generalize the structure of conditional expectation functions with many applications in statistics (Schölkopf and Smola, 2002;Berlinet and Thomas-Agnan, 2011;Kallus, 2016Kallus, , 2018. Any positive semi-definite kernel K : Z ⇥ Z ! R on a ground space Z defines a Hilbert space given by the completion of the span of all K(z, ·) over z 2 Z endowed with the inner product hK(z, ·), K(z 0 , ·)i = K(z, z 0 ). Commonly used kernels are the polynomial, Gaussian, and Matérn kernels (Schölkopf and Smola, 2002).
If k · k (t) is an RKHS norm given by the kernel K t , then we can express (t) at (W ) as a convex-quadratic function in W . Specifically, let us define the matrix K t 2 R n⇥n as ) and note that it is positive semidefinite by definition.
Then, for t = 1, we have that, at is the diagonal matrix with I[A it = a t ] in its i th diagonal entry, and for t 2, More detail on eqs. (14)- (15) is given in the Appendix.
1 , which is given by setting every entry i, j of K t to 0 whenever A it 6 = A jt , and K = P T t=1 K t . We then get that Finally, to obtain weights that optimally balance covariates to control for time-dependent confounding while controlling precision we solve the quadratic optimization problem, where K = K + 2 I, K = K 1 + 2 I. We call this proposed methodology and the result of eq. (17), Kernel Optimal Weighting (KOW).

13
Solutions to the quadratic optimization problem (17) depend on several factors. First, they depend on the choice of the kernel and its hyperparameters. There are some existing practical guidelines on these choices (Schölkopf and Smola, 2002;Rasmussen and Williams, 2006), on which we rely as explained below. Second, they depend on the penalization parameter . Finally, solutions to eq. (17) depend on the chosen set of lagged covariates to include in each kernel. In this section, we introduce some practical guidelines on how to apply KOW in consideration of these factors.
For each t, the unknown function g (t) a (A t 1 , X t ) has two distinct inputs: the treatment history and the confounder history. To reflect this structure, we suggest to specify the kernel and a confounder history kernel K (2) t . This simplifies the process of specifying the kernels. We further suggest that for the treatment history to use a linear kernel involving`lagged treatments, K a s a 0 s , and for the confounder history to use a polynomial kernel involving the time-invariant confounders and lagged time-dependent confounders, K where ✓ > 0 and d 2 N are hyperparameters. We discuss the choice of the number of lags and the hyperparameters below. In our simulation study in Section 5, we show that the MSE of the MSM-estimated e↵ect using KOW with a product of linear kernel and a quadratic kernel (d = 2) outperforms estimates using weights obtained by IPTW, sIPTW and CBPS in all considered simulated scenarios. We again use this choice of kernels in our empirical applications of KOW to real datasets in Section 7. Many other choices of kernel are also possible and may be more appropriate in a particular application, but we suggest the above combination as a generic and successful recipe.
When using kernels, preprocessing the data is an important step. In particular, normalization is employed to avoid unit dependence and covariates with high variance dominating those with smaller ones. Consequently, we suggest, beforehand, to scale the covariates related to the treatment and confounder histories to have mean 0 and variance 1.
To tune the kernels' hyperparameters and the penalization parameter , we follow Kallus (2016) and use the empirical Bayes approach of marginal likelihood (Rasmussen and Williams, 2006). We postulate a Gaussian process prior g (t) ⇠ GP(c t 1, K t (✓)), where c t 1 is a constant function and K t (✓ t ) is a kernel that depends on some set of hyperparameters ✓ t . For each t, we then maximize the marginal likelihood of seeing the data It would be more correct to consider the marginal likelihood of observing the partial means of outcomes, but we find that this much simpler approach su ces for learning the right representation of the data (✓ t ) and the right penalization parameter ( ) and it enables the use of existing packages such as GPML (Rasmussen and Nickisch, 2010). We demonstrate this in the simulations presented in Section 5, and in particular in Figures 3 and 4 we see that this approach leads to a value of the penalization parameter that is near that which minimizes the resulting MSE of the MSM over possible parameters.
Another practical concern is how many lagged covariates to include in each of the kernels K t . When deriving inverse probability weights, it is common to model the denominator in eq. (4) by fitting a pooled logistic model (D'Agostino et al., 1990) including only the time-invariant confounders, X 1 , the time-dependent confounders at time t, X t , and the one-time lagged treatment history, A t 1 , rather than the entire histories, i.e., logit P( (Hernán et al., 2001(Hernán et al., , 2002. This can be understood as a certain Markovian assumption about the data generating process which simplifies the modeling when T is large. The same can be done in the case of KOW, where we may assume that g (t) a is only a function of the one-time lagged treatment, the time-dependent counfounders at time t, and the time-invariant confounders, a (A t 1 , X 1 , X t ), and correspondingly let the kernel K t only depend on A t 1 , X 1 , and X t . More generally, we can consider including any amount of lagged variables, as represented by the parameter`in the above specification of the linear and polynomial kernels. In Section 7.2, we consider an empirical setting where T is small and specify the kernels using the whole treatment and confounders histories (`= T ). However, in Section 7.1 we consider a setting where T is large and, following previous approaches studying the same dataset using IPTW with a logistic model of only the one-time lags (Hernán et al., 2000(Hernán et al., , 2001(Hernán et al., , 2002, we keep only the baseline and one-time-lagged data in each kernel specification (`= 1).
Certain datasets, such as the one we study in Section 7.1, have repeated observations of outcomes at each time t = 1, . . . , T . Thus, for each subject, we have T observations to be used to fit the MSM. Correspondingly, each observation should be weighted appropriately.
This can be seen as T instances of the weighting problem. For (s)IPTW, this boils down to restricting the products in the numerator and denominator of eq. (4) to be only up to t for each t = 1, . . . , T . Similarly, in the case of KOW, we propose to solve eq. (17) for each value of t = 1, . . . , T , producing n ⇥ T weights, one for each of the outcome observations, to be used in fitting the MSM. This is demonstrated in Section 7.1.
In the case of a single, final observation of outcome, normalizing the weights, whether IPTW or KOW, does not a↵ect the fitted MSM as it amounts to multiplying the leastsquares loss by a constant factor. But in the repeated observation setting described above, normalizing each set of weights for each time period separately can help. Correspondingly, we can add a constraint to eq. (17) that the mean of the weights must equal one for each time period separately, which we demonstrate in Section 7.1.

16
In this section, we show the results of a simulation study aimed at comparing the bias and MSE of estimating the cumulative e↵ect of a time-varying treatment on a continuous outcome by using an MSM with weights obtained by each of KOW, IPTW, sIPTW, and CBPS.

Setup
We considered two di↵erent simulated scenarios with T = 3 time periods, (1) linear, where the treatment was modeled linearly, and (2) nonlinear, where it was modeled quadratically.
In both scenarios, we modeled the outcomes nonlinearly so as not to favor our method unfairly. We tuned the kernel's hyperparameters and the penalization parameter as presented in Section 4 and computed bias and MSE over 1,000 replications for each of varying sample sizes, n = 100, . . . , 1,000. In addition, to study the impact of the penalization parameter on bias and MSE, in both scenarios, we fixed the sample size to n = 500 and considered a grid of 25 values for .
For the nonlinear scenario, we drew the data from following model: k = 1, 2, 3. The intercepts 1.91 and 21.46 are chosen so the marginal mean of Y i is 0.
In each scenario and for each replication, we computed two sets of KOW weights. We obtain the first by using the product of two linear kernels (K 1 ), one for the treatment history and one for the confounder history, and the second by using the product of a linear kernel for the treatment history and a quadratic kernel for the confounder history (K 2 ).
As presented in Section 4, we rescaled the variables before inputting them to the kernel and, for each replication, we tuned and the kernels' hyperparameters by using Gaussianprocess marginal likelihood. We also computed two sets of IPTW and sIPTW weights.
We obtained the first by fitting a logistic regression for treatment using the interaction of treatment and confounder histories (i.e., A i,t 1 ⇤X i,t in R notation, which is well-specified for i,t ; we term this the linear specification) and the second by adding quadratic confounder terms (i.e., A i,t 1 ⇤ X i,t ⇤ X i,t in R notation, which is well-specified for ⇡ (NL) i,t ; we term this the non-linear specification). The numerator of sIPTW in either case was obtained by fitting a logistic regression on the treatment history alone. Finally, we compute two sets of weights using CBPS: one using the covariates as they are (linear CBPS) and one by augmenting the covariates with all quadratic monomials (non-linear CBPS). We used the full (non-approximate) version of CBPS.
We computed the causal parameter of interest by using WLS, regressing the outcome on the cumulative treatment and using weights computed by each of the methods. Specifically, in the linear scenario, we computed weights using (1) K 1 for KOW, the linear specification for IPTW and sIPTW, and linear CBPS, which we refer to as the correct case, and (2) K 2 for KOW, the nonlinear specification for IPTW and sIPTW, and the nonlinear CBPS, which we refer to as the overspecified case. In the nonlinear scenario, we again used each of the above, but refer to the first as the misspecified case and the second as the correct case. We highlight that these terms may only reflect the model specification for IPTW and sIPTW, as CBPS does not require a particular specification and the function g (t) a need not necessarily be in the RKHS that either kernel specify.
We used Gurobi and its R interface (Gurobi Optimization, 2018) to solve eq. (13) and optimize the KOW weights, the MatLab package GPML (Rasmussen and Nickisch, 2010) to perform the marginal likelihood estimation of hyperparameters, the R package R.matlab to call MatLab from within R, the R command glm to fit treatment models for IPTW and sIPTW, the R package CBMSM for CBPS, and the R command lm to fit the MSM.

Results
In this section we discuss the results obtained in the simulation study across sample sizes and across values of the penalization parameter, . In summary, the proposed KOW outperformed IPTW, sIPTW and CBPS with respect to MSE across all sample sizes and simulation scenarios. An important result is that, in the misspecified case, KOW showed a lower bias and MSE than that of IPTW, sIPTW and CBPS across all considered sample sizes.  Figure 2). In summary, the MSE obtained by using KOW was lower than that of IPTW, sIPTW and CBPS across all considered sample sizes.

Across sample sizes
As the next section shows, the larger biases in some of the cases are driven by the choice of penalization parameter . Here we choose with an eye toward minimizing MSE. A smaller , it is shown next, can lead to KOW having both smaller bias and MSE than other methods, but the total benefit in MSE is smaller.

Figures 3 and 4 show the ratios of squared biases (left panels) and of MSEs (right panels)
when comparing KOW with IPTW (solid), sIPTW (dashed) and CBPS (dotted) across di↵erent values of and n = 500 in the linear and nonlinear scenarios, respectively. Values above 1 means that KOW had a lower bias or MSE. For zero or small , KOW significantly outperformed IPTW, sIPTW and CBPS with respect to bias. In many cases, the MSE was also smaller for zero . But, the biggest benefit in MSE was seen for larger . The peaks of the right panels represent the points for which is optimal, i.e., the MSE of KOW is minimized. The solid vertical lines on the right panels show the mean values across replications of the value obtained by the procedure described in Section 4 and 5.1 as done in the previous section. It can be seen that these are very near the points at which the MSE is minimized. The benefit in MSE both at and around this point was significant across all scenarios.   Vertical bars show the mean value of = 0, . . . , 3,000, across simulations, obtained as described in Section 5.2.1. T and linearly in the number of covariates. Imai and Ratkovic (2015) also propose using an approximate low-rank matrix that ignores certain covariance terms to make the matrix inversion faster.

Computational time of KOW
Here we compare KOW, CBPS with full covariance matrix (CBPS-full), and CBPS with its low-rank approximation (CBPS-approx) when increasing the number of time periods and the number of covariates. Specifically, following the linear-correct scenario presented in Section 5.1, we fixed the sample size equal to n = 100 and randomly generated 100 samples considering T = 3, . . . , 10, and p = 3, . . . , 8, where p is the total number of covariates X t for each t. We fixed the number of covariates to be equal to p = 3 when evaluating the mean computational times over time periods, while we fixed the number of time periods to be equal to T = 5 when analyzing over the number of covariates. For each sample, we computed the KOW weights by solving eq. (17) using kernel K 1 . We used Gaussian process marginal likelihood to tune the kernels' hyperparameters and penalization parameter. We computed CBPS weights using the linear CBPS as in Section 5.1. We used the R package rbenchmark to compute the computational time on a PC with an i7-3770 processor, 3.4 GHz, 8GB RAM and a Linux Ubuntu 16.04 operating system. Solid lines of Figure 5 represent mean computational times for KOW, dashed for CBPSfull, and dotted for CBPS-approx. When the number of time periods was relatively small, the mean computational time of KOW was higher compared with both CBPS methods (left panel of Figure 5). However, the mean computation time of KOW over time periods increased linearly while that of both CBPS methods increased exponentially. This is due to the fact that, as presented in Section 3.1, the number of imbalances that we need to minimize grows linearly in the number of time periods. The mean computational time required by KOW when increasing the number of covariates remained constant, while it increased for both CBPS-full and CBPS-approx, with CBPS-full increasing more rapidly.
In summary, KOW was less a↵ected by the total number of time periods and covariates compared with CBPS with full and low-rank approximation matrix.
Computing KOW required three steps: tuning the parameters, constructing the matrices for problem (17), and solving problem (17). On average, for T = 3, the first step required 21% of the total computational time, the second 78.8%, and the last 0.2%. Thus, solving the optimization problem itself is very fast and is not the bottleneck.

KOW with informative censoring
In longitudinal studies, participants may drop out the study before the end of the follow-up time. When these losses are due to reasons related to the study, selection bias is introduced.
This phenomenon is referred to as informative censoring and it is common in the context of survival analysis where the interest is on analyzing time-to-event outcomes. Under the assumptions of consistency, positivity, and sequential ignorability of both treatment and censoring, Robins et al. (1999) showed that a consistent estimate of the causal e↵ect of a time-varying treatment can be obtained by weighting each subject i = 1, . . . , n at each time period by the product of inverse probability of treatment and censoring weights.
Inverse probability of treatment weights are obtained as presented in Section 2, while inverse probability of censoring weights are usually obtained by inverting the probability of being uncensored at time t, given the treatment history up to time t 1 and the confounder history up to time t (Hernán et al., 2001).
In this section we extend KOW to similarly handle informative censoring. We demonstrate that under sequentially ignorable censoring, minimizing the very same discrepancies as before at each time period but restricted to the units for which data is available actually controls for both the error in estimation due to time-dependent confounding as well as due to informative censoring. Thus, KOW naturally extends to the setting with informative censoring.
Let C it 2 {0, 1} for t = 1, . . . , T indicate whether unit i is censored in time period t and let C i0 = 0. Note that C it = 1 implies that C i,t+1 = 1 and that C it = 0 implies that C i,t 1 = 0. All we require is that we (at least) observe outcomes Y i whenever C iT = 0, X it whenever C i,t 1 = 0, and A it whenever C it = 0. Note we might observe more, such as the treatment at time t for a unit with C i,t 1 = 0, or perhaps only some of the data after censoring is corrupted, but that is not required. We summarize the assumption of sequentially ignorable censoring as Under this assumption along with consistency and assumptions (1)-(2), we can now decompose the di↵erence between the weighted average uncensored outcomes among the a-treated units and the average potential outcome of a. Define for t 2 (1) We then show in the appendix that Note that, aside from the function g a , all the discrepancies in eq. (19) involve only observed, uncensored quantities. Thus, we can write the di↵erence between the weighted average outcome among the uncensored a-treated units, which are the only units where we observe the outcome of a, and the average potential outcome of a among everyone, as the sum of discrepancies in uncensored treatment and confounder histories.
We therefore define the imbalance IMB(W ; (g (t) a ) t2{1,2},a2A ) and the normalized squared worst case imbalance B 2 (W ) just as before in eqs. (8) and (9) but using the above new definitions forˆ (t) at . Finally, again using kernels to specify norms, we obtain weights that optimally balance covariates to control for time-dependent confounding and account for informative censoring while controlling precision by solving the quadratic optimization problem, where K = K + 2 I, , and e is the vector of all ones.

29
In this section, we present two empirical applications of KOW. In the first, we estimate the e↵ect of treatment initiation on time to death among people living with HIV (PLWH). In the second, we evaluate the impact of negative advertising on election outcomes.

The e↵ect of HIV treatment on time to death
In this section, we analyze data from the Multicenter AIDS Cohort Study (MACS) to study the e↵ect of the initiation time of treatment on time to death among PLWH. Indeed, due to the longitudinal nature of HIV treatment and the presence of time-dependent confounding, MSMs have been widely used to study causal e↵ects in this domain (Hernán et al., 2000(Hernán et al., , 2001HIV-Causal Collaboration et al., 2010;HIV-Causal Collaboration, 2011;Lodi et al., 2017, among others). As an example of time-dependent confounding, CD4 cell count, a measurement used to monitor immune defenses in PLWH and to make clinical decisions, is a predictor of both treatment initiation and survival, as well as being itself influenced by prior treatments. Recognizing the censoring in the MACS data, Hernán et al. (2000) showed how to estimate the parameters of the MSM by inverse probability of treatment and censoring weighting (IPTCW).
Here, we apply KOW as proposed in Section 6 to handle both time-dependent confounding and informative censoring while controlling precision. We considered the following potential time-dependent confounders associated with the e↵ect of treatment initiation and the risk of death: CD4 cell count, white blood cell count, red blood cell count, and platelets.
We also identified the age at baseline as a potential time-invariant confounding factor. We considered only recently developed HIV treatments, thus, including in the analysis only PLWH that started treatment after 2001. The final sample was comprised of a total of n = 344 people and 760 visits, with a maximum of T = 8 visits per person. We considered two sets of KOW weights, either obtained by using a product of (1) two linear kernels, one for the treatment history and one for the confounder history (K 1 ) or (2) a linear kernel for the treatment history and a polynomial kernel of degree 2 for the confounder history (K 2 ). We scaled the covariates related to the treatment and confounder history, and tuned the kernels' hyperparameters and the penalization parameter by using Gaussian processes marginal likelihood as presented in Section 4. Following previous approaches studying the HIV treatment using IPTCW that modeled treatment and censoring using single time lags (Hernán et al., 2000(Hernán et al., , 2001(Hernán et al., , 2002, we included in each kernel the time-invariant confounders, the previous treatment, A t 1 , and the time-dependent confounders at time t, X t , instead of the entire histories. As described in Section 4, since we have repeated observations of outcomes, we compute a set of KOW weights by solving the optimization problem (21) for each horizon up to T . In addition, as described in Section 4, we constrained the mean of the weights to be equal to one.
We compared the results obtained by KOW with those from IPTCW and stabilized-IPTCW (sIPTCW). The latter sets of weights were obtained by using a logistic regression on the treatment history and the aforementioned time-invariant and time-dependent confounders and using only one time lag for each of the treatment and time-dependent confounders as done in previous approaches studying the HIV treatment using IPTCW (Hernán et al., 2000(Hernán et al., , 2001(Hernán et al., , 2002. The numerator of sIPTCW was computed by modeling h(A t ) in eq. (4) with a logistic regression on the treatment history only using one time lag. We modeled the inverse probability of censoring weights similarly. The final sets of IPTCW and sIPTCW weights were obtained by multiplying inverse probability of treatment and censoring weights. We did not compare the results with those of CBPS because it does not handle informative censoring. In particular, CBPS requires a complete n ⇥ T We estimated the hazard ratio of the risk of death by using a weighted Cox regression model (Hernán et al., 2000) weighted by KOW, IPTCW, or sIPTCW and using robust standard errors (Freedman, 2006). We used Gurobi and its R interface to solve eq. (21) and obtain the KOW weights, the Matlab package GPML to perform the marginal likelihood estimation of hyperparameters, the R package R.matlab to call MatLab from within R, the R package ipw (van der Wal et al., 2011) to fit the treatment models for IPTCW and sIPTCW, and the R command coxph (with robust variance estimation) to fit the outcome model. Table 1 summarizes the result of our analysis. Both KOW (K 1 ) and (K 2 ) showed a significant protective e↵ect of HIV treatment on time to death among PLWH.
IPTCW showed a similar e↵ect but with lower precision, resulting in a non-significant e↵ect. With similar precision obtained by KOW, sIPTCW showed a non-significant e↵ect of HIV treatment on time to death. It took 13.5 seconds to obtain a solution for KOW.

The impact of negative advertising on election outcomes
In this section, we analyze a subset of the dataset from Blackwell (2013) to estimate the impact of negative advertising on election outcomes. Because of the dynamic and longitudinal nature of the problem and presence of time-dependent confounders, MSMs have been used previously used to study the question (Blackwell, 2013;Imai and Ratkovic, 2015). Specifically, poll numbers are time-dependent confounders as they might both be a↵ected by negative advertising and might also a↵ect future poll numbers. We follow Imai and Ratkovic (2015) in constructing the subset of the data from Blackwell (2013) We estimated the parameters of two MSMs, the first having separate coe cients for the treatment in each time period and the second having one coe cient for the cumulative e↵ect of negative advertising. Each MSM was fit using weights given by each of KOW, IPTW, sIPTW, and CBPS (both full and approximate). We used the following time-dependent confounders: Democratic share of the polls, proportion of undecided voters, and campaign length. We also used the following time-invariant confounders: baseline Democratic vote share, proportion of undecided voters, status of incumbency, election year and type of o ce.
We obtained two sets of KOW weights by using a product of (1) two linear kernels, one for the history of negative advertising and one for the confounder history (K 1 ) and (2) a linear kernel for the history of negative advertising and a polynomial kernel of degree 2 for the confounder history (K 2 ). The kernels were over the complete confounder history up to time t, X t , and two time-lags of treatment history, A t 1 , A t 2 . We scaled the covariates and tuned the kernels' hyperparameters and the penalization parameter by using Gaussian 33 processes marginal likelihood. We obtained the final set of KOW weights by solving eq. (17).
We compared the results obtained by KOW with those from IPTW, sIPTW, CBPS-full, and CBPS-approx. To obtain the sets of IPTW, sIPTW, and CBPS weights, we used logistic models conditioned on the confounder history and two time-lags from the treatment history (as done in Imai and Ratkovic, 2015). To compute the numerator of sIPTW weights, we used a logistic regression conditioned only on two time-lags from the treatment history. We used Gurobi and its R interface to solve eq. (21) and obtain the KOW weights, the Matlab package GPML to perform the marginal likelihood estimation of hyperparameters, the R package R.matlab to call MatLab from within R, the R command glm to fit the treatment models for IPTW and sIPTW, the R package CBMSM for CBPS, the R command lm to fit the outcome model, and the R package sandwich to estimate robust standard errors. Table 2 summarizes the results of our analysis, reporting robust standard errors (Freedman, 2006). The first six rows of Table 2 show the e↵ect of the time-specific negative advertising. The last two rows present the e↵ect of the cumulative e↵ect of negative advertising. KOW (K 1 and K 2 ) and IPTW showed similar e↵ects, with increased precision when using KOW except for time 4, in which both methods showed a significant negative e↵ect but with greater precision when using IPTW. sIPTW, CBPS-full and CBPS-approx showed a significant negative e↵ect at time 3 with similar precision. No significant results were obtained when considering the cumulative e↵ect of negative advertising. All except sIPTW, showed a negative cumulative e↵ect. KOW (K 1 ) was the most precise. The computational time to obtain a solution was equal to 12.6 seconds for KOW, while it was equal to 104 seconds for CBPS-full and 3.8 seconds for CBPS-approx.

Conclusions
In this paper we presented KOW, which optimally finds weights for fitting an MSM with the aim of balancing time-dependent confounders while controlling for precision. That KOW uses mathematical optimization to directly and fully balance covariates to minimize error due to imbalances as well as optimize precision explains the better performance of KOW over IPTW, sIPTW and CBPS observed in our simulation study. In addition, as shown in Section 5 and Section 6, the proposed methodology mitigates the possible misspecification of the treatment assignment model, allow balancing non-additive covariate relationships, and can be extended to control for informative censoring and missing data, which is a common feature of longitudinal studies.
Alternative formulations of eq. (13) may be investigated. For example, additional linear constraints can be added to the optimization problem, as shown in the empirical application of Section 7.1, and di↵erent penalties can be considered to control for extreme weights.
For instance, in eq. (13), at the cost of no longer being able to use convex-quadratic optimization, one may directly penalize the covariance matrix of the weighted least-square estimator rather than use a convex-quadratic surrogate as we do.
One may also change the nature of precision control. Here, we suggested penalization in an attempt to target total error. Alternatively, similar to Santacatterina and Bottai (2017), we may reformulate eq. (13) as a constrained optimization problem where the precision of the resulting estimator is constrained by an upper bound ⇠, thus seeking to minimize imbalances subject to having a bounded precision. In our convex formulation, the two are equivalent by Lagrangian duality in that for every precision penalization there is an equivalent precision bound ⇠. However, it may make specifying the parameters easier depending on the application as it may be easier for a practitioner to conceive of 36 desirable a bound on precision. There may also be other ways to choose the penalization parameter. Here we suggested using maximum marginal likelihood but cross validation based on predicting outcomes and their partial means may also be possible.
The flexibility of our approach is that any of these changes amount to simply modifying the optimization problem that is fed to o↵-the-shelf solvers. Indeed, we were able to extend KOW from the standard longitudinal setting to also handle both repeated observations of outcomes and informative censoring. In addition to o↵ering flexibility, the optimization approach we took, which directly and fully minimized our error objective phrased in terms of covariate imbalances, was able to o↵er improvements on the state of the art.