Abstract
Marginal structural models (MSMs) can be used to estimate the causal effect of a potentially time-varying treatment in the presence of time-dependent confounding via weighted regression. The standard approach of using inverse probability of treatment weighting (IPTW) can be sensitive to model misspecification and lead to high-variance estimates due to extreme weights. Various methods have been proposed to partially address this, including covariate balancing propensity score (CBPS) to mitigate treatment model misspecification, and truncation and stabilized-IPTW (sIPTW) to temper extreme weights. In this article, we present kernel optimal weighting (KOW), a convex-optimization-based approach that finds weights for fitting the MSMs that flexibly balance time-dependent confounders while simultaneously penalizing extreme weights, directly addressing the above limitations. We further extend KOW to control for informative censoring. We evaluate the performance of KOW in a simulation study, comparing it with IPTW, sIPTW, and CBPS. We demonstrate the use of KOW in studying the effect of treatment initiation on time-to-death among people living with human immunodeficiency virus and the effect of negative advertising on elections in the United States.
1 Introduction
Marginal structural models (MSMs) offer a successful way to estimate the causal effect of a time-varying treatment on an outcome of interest from longitudinal data in observational studies [1,2]. For example, they have been used to estimate the optimal timing of human immunodeficiency virus (HIV) treatment initiation [3], to evaluate the effect of hormone therapy on cardiovascular outcomes [4], and to evaluate the impact of negative advertising on election outcomes [5]. The increasing popularity of MSMs among applied researchers derives from their ability to control for time-dependent confounders, which are confounders that are affected by previous treatments and affect future ones. In particular, as shown by Robins et al. [2] and Blackwell [5], standard methods, such as regression or matching, fail to control for time-dependent confounding, introducing post-treatment bias. In contrast, MSMs consistently estimate the causal effect of a time-varying treatment via inverse probability of treatment weighting (IPTW), which controls for time-dependent confounding by weighting each subject under study by the inverse of their probability of being treated given covariates, i.e., the propensity score [6], mimicking a sequential randomized trial. In other words, IPTW creates a hypothetical pseudo-population where time-dependent confounders are balanced over time.
Despite their wide range of applications, the use of these methods in observational studies may be jeopardized by their considerable dependence on positivity. This assumption requires that, at each time period, the probability of being assigned to the treatment, conditional on the history of treatment and confounders, is not 0 or 1 [1]. Even if positivity holds theoretically in the population, when propensities are close to 0 or 1, it can be easily practically violated, i.e., when some combinations of confounders and treatment are rare in the observed data [7]. Practical positivity violations lead to extreme and unstable weights, which in turn yield very low precision and misleading inferences [8,9,10]. This is a particular concern with longitudinal data, since IPTW requires conditioning on past treatment sequences in addition to time-dependent confounders thus having greater risk of running into near-zero denominators. In addition, MSMs using IPTW are highly sensitive to misspecification of the treatment assignment model, which can lead to biased estimates [8,11,12].
Various statistical methods have been proposed in an attempt to overcome these challenges. To deal with extreme weights, several authors [12,13] have suggested truncation, whereby outlying weights are replaced with less extreme ones. Santacatterina et al. [14] proposed to use shrinkage instead of truncation as a more direct way to control the bias-variance trade-off. Robins et al. [2] recommended the use of stabilized-IPTW (sIPTW) where inverse probability weights are normalized by the marginal probability of treatment. To mitigate misspecification of the treatment assignment model, Imai and Ratkovic [15] proposed to use the covariate balance propensity score (CBPS), which instead of plugging in a logistic regression estimate of propensity into IPTW finds the logistic model that balances covariates via the generalized method of moments. The method tries to balance the first, and possibly higher, moment of each covariate even if a logistic model is misspecified [16]. The authors also provided balancing conditions for longitudinal studies. Yiu and Su [17] proposed a joint calibration approach to covariate balancing weight estimation for MSMs. The method aims to compute weights that jointly eliminate covariate associations with both treatment assignment and censoring processes. Other methods have been proposed in the literature to manage the trade-off between balance and precision [18,19,20, among others], and that discuss reproducing kernel Hilbert spaces (RKHSs) [21,22,23, among others]. These methods, however, do not directly deal with time-dependent confounders and informative censoring.
In this article, we present and apply kernel optimal weighting (KOW), which provides weights for fitting an MSM that balance time-dependent confounders while controlling for precision. Specifically, by solving a quadratic optimization problem over weights, the proposed method directly minimizes imbalance, defined as the sum of discrepancies between the weighted observed data and the counterfactual of interest over all treatment regimes, while penalizing extreme weights.
This extends the kernel optimal matching method of Kallus [24] and Kallus et al. [25] to the longitudinal setting and to dealing with time-dependent confounders, where, similar to regression and matching, it cannot be applied without introducing post-treatment bias.
The proposed method has several attractive characteristics. First, KOW can balance non-additive covariate relationships by using kernels, which generalize the structure of conditional expectation functions, and does not restrict weights to follow a fixed logistic (or other parametric) form. By doing so, KOW mitigates the effects of possible misspecification of the treatment model. In the simulation study presented in Section 5, we show that KOW is more robust to model misspecification compared with the other methods. In Section 5, we also show how KOW compares favorably with the aforementioned methods in all nonlinear scenarios, and in Section 7.2 we use KOW to balance non-additive covariate relationships estimating the effect of negative advertising on election outcomes. Second, by balancing time-dependent confounders using kernels while penalizing extreme weights, KOW leads to better accuracy, precision, and total error. In Section 5, we show that the mean squared error (MSE) of the estimated effect of a time-varying treatment obtained by using KOW is lower than that obtained by using IPTW, sIPTW, and CBPS in all considered simulated scenarios. Third, differently from ref. [15], where the number of covariate-balancing conditions grows exponentially in the number of time periods, KOW only needs to minimize a number of discrepancies that grows linearly in the number of time periods. This feature leads to a lower computational time of KOW compared with CBPS when the total number of time periods increases, as shown in our simulation study in Section 5.5 and in our study on the effect of negative advertising on election outcomes in Section 7.2. Fourth, KOW can be easily generalized to other settings, such as informative censoring. We do just that in Section 6, and in Section 7.1, we use this extension to study the effect of HIV treatment on time to death among people living with HIV (PLWH). Finally, KOW can be solved by using off-the-shelf solvers for quadratic optimization.
In the next section, we briefly introduce the literature of MSMs (Section 2). In Section 3, we develop and define KOW. We then discuss some practical guidelines on the use of KOW (Section 4). In Section 5, we report the results of a simulation study aimed at comparing KOW with IPTW, sIPTW, and CBPS. In Section 6, we extend KOW to control for informative censoring. We then present two empirical applications of KOW in medicine and political science (Section 7). We offer some concluding remarks in Section 8.
2 MSMs for longitudinal data
In this section, we briefly review MSMs [1,2]. Suppose we have a simple random sample with replacement of size
We impose the assumptions of consistency, non-interference, positivity, and sequential ignorability [26,27]. Consistency and non-interference [also known as SUTVA; 28] can be encapsulated in that the potential outcomes are well-defined and the observed outcome corresponds to the potential outcome of the treatment regime applied to that unit, i.e.,
Sequential ignorability states that the potential outcome
An MSM is a model for the marginal causal effect of a time-varying treatment regime on the mean of
where
where
To overcome this issue, Imai and Ratkovic [15] proposed to estimate weights of the form of equation (4) that improve balance of confounders by generalizing the covariate balancing propensity score (CBPS) methodology. Instead of plugging in probability estimates based on logistic regression, CBPS uses the generalized method of moments to find the logistic regression model that if plugged in would lead to weights,
Differently than IPTW, sIPTW, and CBPS, in the next section, we characterize imbalance as the discrepancies in observed average outcomes due to confounding, consider their worst case values, and use quadratic optimization to obtain weights that directly optimize the balance of time-invariant and time-dependent confounders over all possible weights while controlling precision.
3 KOW
In this section, we present a convex-optimization-based approach that obtains weights that minimize the imbalance due to time-dependent confounding (i.e., maximize balance thereof) while controlling precision. Toward that end, in Section 3.1, we provide a definition of imbalance. Specifically, we define imbalance as the sum of discrepancies between the weighted observed data and the unobserved counterfactual of interest over all treatment regimes. Since this imbalance depends on unknown functions, in Section 3.2 we consider the worst case imbalance, which guards against all possible realizations of the unknown functions. We also show that the worst case imbalance has the attractive characteristic that the number of discrepancies considered grows linearly in the number of time periods and not exponentially like the number of treatment regimes. We finally show how to minimize this quantity while controlling precision using kernels, RKHS and off-the-shelf solvers for quadratic optimization (Sections 3.3 and 3.4).
3.1 Defining imbalance
Consider any population weights
To build intuition we start by explaining this decomposition in the case of two time periods
where the first equality follows from iterated expectation, the second from sequential ignorability, the fourth from iterated expectation and sequential ignorability, and the third and fifth from the following definitions, which exactly capture the difference between the two sides of the third and fifth equalities,
Note our use of
This gives a definition of discrepancy,
We can extend this decomposition to general horizons
The following result gives the general decomposition of the difference between weighted average of observed outcomes and true average of counterfactuals as the sum of
Theorem 1
Under Assumptions (1)–(2), for each
Based on the results of Theorem 1, it is clear that if we want the difference between average counterfactual outcomes and average weighted factual outcomes to be small for all treatment regimes
small for all
The empirical counterparts to
Thus, we will seek sample weights
The particular imbalance of interest is given when we consider
Differently, we will seek to find weights that directly minimize imbalance. There are two main challenges in this task. The first challenge is that the imbalance of interest depends on some unknown functions
3.2 Worst case imbalance
To overcome the fact that we do not actually know the functions
where
Given these, we can define the worst case discrepancies,
Note that
Then the worst case imbalance is given by
What is important to note is that this shows that the discrepancies of interest are essentially the same regardless of the particular treatment regime trajectory
3.3 Minimizing imbalance while controlling precision
We can obtain minimal imbalance by minimizing
where
In the next section, we discuss a specific choice of the norm that specified the worst case discrepancies
3.4 RKHS and quadratic optimization to balance time-dependent confounders
An RKHS is a Hilbert space of functions which is associated with a kernel (the reproducing kernel). Specifically, any positive semi-definite kernel
The following theorem shows that if
Theorem 2
Define the matrix
and note that it is positive semidefinite by definition. Then, if the norm
where
As an aside, we note that, when
Based on Theorem 2, we can now express the worst case imbalance,
Finally, to obtain weights that balance covariates to control for time-dependent confounding while controlling precision we solve the quadratic optimization problem,
where
4 Practical guidelines
Solutions to the quadratic optimization problem (12) depend on several factors. First, they depend on the choice of the kernel and its hyperparameters. There are some existing practical guidelines on these choices [34,37], on which we rely as explained below. Second, they depend on the penalization parameter
For each
When using kernels, preprocessing the data is an important step. In particular, normalization is employed to avoid unit dependence and covariates with high variance dominating those with smaller ones. Consequently, we suggest, beforehand, to scale the covariates related to the treatment and confounder histories to have mean 0 and variance 1.
To tune the kernels’ hyperparameters and the penalization parameter
Another practical concern is how many lagged covariates to include in each of the kernels
Certain datasets, such as the one we study in Section 7.1, have repeated observations of outcomes at each time
In the case of a single, final observation of outcome, normalizing the weights, whether IPTW or KOW, does not affect the fitted MSM as it amounts to multiplying the least-squares loss by a constant factor. But in the repeated observation setting described above, normalizing each set of weights for each time period separately can help. Correspondingly, we can add a constraint to equation (12) that the mean of the weights must equal one for each time period separately, which we demonstrate in Section 7.1.
Similar to refs [1,29,30] we suggest using Wald confidence intervals constructed using robust (sandwich) standard errors.
5 Simulations
In this section, we show the results of a simulation study aimed at comparing the bias and MSE of estimating the cumulative effect of a time-varying treatment on a continuous outcome by using an MSM with weights obtained by each of KOW, IPTW, sIPTW, and CBPS.
5.1 Setup
We considered two different simulated scenarios with
For the linear scenario, we drew the data from the following model:
where
For the nonlinear scenario, we drew the data from the following model:
where
The intercepts
In each scenario and for each replication, we computed two sets of KOW weights. We obtain the first by using the product of two linear kernels (
We computed the causal parameter of interest by using WLS, regressing the outcome on the cumulative treatment and using weights computed by each of the methods. Specifically, in the linear scenario, we computed weights using (1)
We used
5.2 Results
In this section, we discuss the results obtained in the simulation study across sample sizes and across values of the penalization parameter,
5.3 Across sample sizes
Figure 1 shows bias and MSE of the estimated time-varying treatment effect using KOW (solid), IPTW (dashed), sIPTW (dotted), and CBPS (dashed-dotted) when increasing the sample size from

Bias and MSE of the estimated time-varying treatment effect using KOW (solid), IPTW (dashed), sIPTW (dotted), and CBPS (dashed-dotted) when increasing the sample size from

Bias and MSE of the estimated time-varying treatment effect using KOW (solid), IPTW (dashed), sIPTW (dotted), and CBPS (dashed-dotted) when increasing the sample size from

Coverage of the 95% Wald confidence interval using KOW across sample sizes and across simulation scenarios: linear-correct (solid), linear-overspecified (dot-dashed), nonlinear-correct (dotted) and nonlinear-misspecified (dashed).
5.4 Across values of the penalization parameter,
λ
Figures 4 and 5 show the ratios of squared biases (left panels) and of MSEs (right panels) when comparing KOW (denominator) with IPTW (solid), sIPTW (dashed), and CBPS (dotted) (numerators) across different values of

Ratios of squared biases and MSEs comparing KOW with IPTW (solid), sIPTW (dashed), and CBPS (dotted) across values of

Ratios of squared biases and MSEs comparing KOW with IPTW (solid), sIPTW (dashed), and CBPS (dotted) across values of
5.5 Computational time of KOW
In this section, we present the results of a simulation study aimed at comparing the mean computational time of KOW and CBPS. Compared to sIPTW based on pooled logistic regression, which is generally very fast, both KOW and CBPS have a nontrivial computational time that can grow with both the total number time periods
Here we compare KOW, CBPS with full covariance matrix (CBPS-full), and CBPS with its low-rank approximation (CBPS-approx) when increasing the number of time periods and the number of covariates. Specifically, following the linear-correct scenario presented in Section 5.1, we fixed the sample size equal to
Solid lines of Figure 6 represent mean computational times for KOW, dashed for CBPS-full, and dotted for CBPS-approx. When the number of time periods was relatively small, the mean computational time of KOW was higher compared with both CBPS methods (left panel of Figure 6). However, the mean computation time of KOW over time periods increased linearly while that of both CBPS methods increased exponentially. This is due to the fact that, as presented in Section 3.1, the number of imbalances that we need to minimize grows linearly in the number of time periods. The mean computational time required by KOW when increasing the number of covariates remained constant, while it increased for both CBPS-full and CBPS-approx, with CBPS-full increasing more rapidly. In summary, KOW was less affected by the total number of time periods and covariates compared with CBPS with full and low-rank approximation matrix.

Mean computational time in seconds of KOW (solid), CBPS with full covariate matrix (dashed), and CBPS with the low-rank approximation of the full matrix (dotted) over time periods when
Computing KOW required three steps: tuning the parameters, constructing the matrices for problem (12), and solving problem (12). On average, for
6 KOW with informative censoring
In longitudinal studies, participants may drop out the study before the end of the follow-up time and their outcomes are, naturally, missing observations. When this missingness is due to reasons related to the study (i.e., related to the potential outcomes), selection bias is introduced. This phenomenon is referred to as informative censoring and it is common in the context of survival analysis where the interest is on analyzing time-to-event outcomes. Under the assumptions of consistency, positivity, and sequential ignorability of both treatment and censoring, Robins et al. [43] showed that a consistent estimate of the causal effect of a time-varying treatment can be obtained by weighting each subject
In this section, we extend KOW to similarly handle informative censoring. We demonstrate that under sequentially ignorable censoring, minimizing the very same discrepancies as before at each time period, restricted to the units for which data are available, actually controls for both time-dependent confounding and informative censoring. Thus, KOW naturally extends to the setting with informative censoring.
Let
Let us redefine
Similar to Theorem 1, the following theorem shows that we can write the difference between the weighted average outcome among the uncensored
Theorem 3
Under Assumptions (1)–(2) and (13),
We then define the empirical counterparts to
where
7 Applications
In this section, we present two empirical applications of KOW. In the first, we estimate the effect of treatment initiation on time to death among PLWH. In the second, we evaluate the impact of negative advertising on election outcomes.
7.1 Effect of HIV treatment on time to death
In this section, we analyze data from the Multicenter AIDS Cohort Study (MACS) to study the effect of the initiation time of treatment on time to death among PLWH. Indeed, due to the longitudinal nature of HIV treatment and the presence of time-dependent confounding, MSMs have been widely used to study causal effects in this domain [3,30,41,44,45, among others]. As an example of time-dependent confounding, CD4 cell count, a measurement used to monitor immune defenses in PLWH and to make clinical decisions, is a predictor of both treatment initiation and survival, as well as being itself influenced by prior treatments. Recognizing the censoring in the MACS data, Hernán et al. [41] showed how to estimate the parameters of the MSM by inverse probability of treatment and censoring weighting (IPTCW).
Here, we apply KOW as proposed in Section 6 to handle both time-dependent confounding and informative censoring while controlling precision. We considered the following potential time-dependent confounders associated with the effect of treatment initiation and the risk of death: CD4 cell count, white blood cell count, red blood cell count, and platelets. We also identified the age at baseline as a potential time-invariant confounding factor. We considered only recently developed HIV treatments, thus, including in the analysis only PLWH that started treatment after 2001. The final sample was composed of a total of
We compared the results obtained by KOW with those from IPTCW and sIPTCW. The latter sets of weights were obtained by using a logistic regression on the treatment history and the aforementioned time-invariant and time-dependent confounders and using only one time lag for each of the treatment and time-dependent confounders as done in previous approaches studying the HIV treatment using IPTCW [30,40,41]. The numerator of sIPTCW was computed by modeling
We estimated the hazard ratio of the risk of death by using a weighted Cox regression model [41] weighted by KOW, IPTCW, or sIPTCW and using robust standard errors [29]. We used
Effect of HIV treatment on time to death
KOW | Logistic | |||
---|---|---|---|---|
|
|
IPTCW | sIPTCW | |
|
0.40∗ | 0.48∗ | 0.14 | 1.25 |
SE | (0.30) | (0.28) | (1.15) | (0.30) |
Note:
7.2 Impact of negative advertising on election outcomes
In this section, we analyze a subset of the dataset from the study by Blackwell [5] to estimate the impact of negative advertising on election outcomes. Because of the dynamic and longitudinal nature of the problem and presence of time-dependent confounders, MSMs have been previously used to study the question [5]. Specifically, poll numbers are time-dependent confounders as they might both be affected by negative advertising and might also affect future poll numbers. We constructed the subset of the data from the study by Blackwell [5] by considering the 5 weeks leading up to each of 114 elections held 2000–2006 (58 US Senate, 56 US gubernatorial). Differently from Section 7.1 in which the outcome was observed at each time period, in this analysis, the binary election outcome was observed only at the end of each 5-week trajectory. In addition, all units were uncensored.
We estimated the parameters of two MSMs, the first having separate coefficients for negative advertising in each time period and the second having one coefficient for the cumulative effect of negative advertising. Each MSM was fit using weights given by each of KOW, IPTW, sIPTW, and CBPS (both full and approximate). We used the following time-dependent confounders: Democratic share of the polls, proportion of undecided voters, and campaign length. We also used the following time-invariant confounders: baseline Democratic vote share, proportion of undecided voters, status of incumbency, and election year and type of office. We obtained two sets of KOW weights by using a product of (1) two linear kernels, one for the history of negative advertising and one for the confounder history (
Table 2 summarizes the results of our analysis, reporting robust standard errors [29]. The first six rows of Table 2 show the effect of the time-specific negative advertising. The last two rows present the effect of the cumulative effect of negative advertising. KOW (
Impact of negative advertising on election outcomes
|
KOW | Logistic | CBPS | |||
---|---|---|---|---|---|---|
SE |
|
|
IPTW | sIPTW | Full | Approx |
Intercept |
|
|
|
|
|
|
(2.15) | (2.38) | (2.88) | (2.98) | (2.70) | (2.39) | |
Negative
|
2.43 | 3.27 | 4.41 |
|
|
|
(1.86) | (1.86) | (2.56) | (3.26) | (2.49) | (2.22) | |
Negative
|
3.73 | 3.24 |
|
3.17 | 3.55 | 2.65 |
(2.18) | (2.22) | (2.38) | (3.19) | (2.73) | (2.42) | |
Negative
|
|
|
|
|
|
|
(2.34) | (2.45) | (2.54) | (3.84) | (3.20) | (3.24) | |
Negative
|
|
|
|
2.34 |
|
|
(2.57) | (2.75) | (1.54) | (3.11) | (3.71) | (3.59) | |
Negative
|
|
|
|
|
|
|
(1.42) | (1.59) | (2.19) | (2.59) | (1.62) | (1.54) | |
Intercept |
|
|
|
|
|
|
(2.45) | (2.63) | (4.29) | (4.15) | (2.68) | (2.49) | |
Cumulative |
|
|
|
1.91 |
|
|
(0.58) | (0.64) | (1.57) | (1.15) | (0.65) | (0.77) |
Note:
8 Conclusion
In this article, we presented KOW, which finds weights for fitting an MSM with the aim of balancing time-dependent confounders while controlling for precision. That KOW uses mathematical optimization to directly and fully balance covariates as well as optimize precision explains the better performance of KOW over IPTW, sIPTW, and CBPS observed in our simulation study. In addition, as shown in Sections 3.2, 5, and 6, the proposed methodology only needs to minimize a number of discrepancies that grow linearly in the number of time periods, mitigates the possible misspecification of the treatment assignment model, allows balancing non-additive covariate relationships, and can be extended to control for informative censoring, which is a common feature of longitudinal studies.
Alternative formulations of our imbalance-precision optimization problem, equation (10), may be investigated. For example, additional linear constraints can be added to the optimization problem, as shown in the empirical application of Section 7.1, and different penalties can be considered to control for extreme weights. For instance, in equation (10), at the cost of no longer being able to use convex-quadratic optimization, one may directly penalize the covariance matrix of the WLS estimator rather than use a convex-quadratic surrogate as we do.
One may also change the nature of precision control. Here, we suggested penalization in an attempt to target total error. Alternatively, similar to ref. [49], we may reformulate equation (10) as a constrained optimization problem where the precision of the resulting estimator is constrained by an upper bound
The flexibility of our approach is that any of these changes amount to simply modifying the optimization problem that is fed to an off-the-shelf solver. Indeed, we were able to extend KOW from the standard longitudinal setting to also handle both repeated observations of outcomes and informative censoring. In addition to offering flexibility, the optimization approach we took, which directly and fully minimized our error objective phrased in terms of covariate imbalances, was able to offer improvements on the state of the art.
Acknowledgements
This article is based upon work supported by the National Science Foundation under Grants Nos. 1656996 and 1740822.
-
Funding information: This article is based upon work supported by the National Science Foundation under Grants Nos. 1656996 and 1740822.
-
Conflict of interest: Authors state no conflict of interest.
Appendix
Proof of Theorem 1
For clarity, we prove this for