E ﬃ cient and ﬂ exible mediation analysis with time - varying mediators, treatments, and confounders

: Understanding the mechanisms of action of interventions is a major general goal of scienti ﬁ c inquiry. The collection of statistical methods that use data to achieve this goal is referred to as mediation analysis . Natural direct and indirect e ﬀ ects provide a de ﬁ nition of mediation that matches scienti ﬁ c intuition, but they are not identi ﬁ ed in the presence of time - varying confounding. Interventional e ﬀ ects have been proposed as a solution to this problem, but existing estimation methods are limited to assuming simple ( e.g., linear ) and unrealistic relations between the mediators, treatments, and confounders. We present an identi - ﬁ cation result for interventional e ﬀ ects in a general longitudinal data structure that allows ﬂ exibility in the speci ﬁ cation of treatment - outcome, treatment - mediator, and mediator - outcome relationships. Identi ﬁ cation is achieved under the standard no - unmeasured - confounders and positivity assumptions. In this article, we study semi - parametric e ﬃ ciency theory for the functional identifying the mediation parameter, including the non - parametric e ﬃ ciency bound, and was used to propose non - parametrically e ﬃ cient estimators. Implementation of our estimators only relies on the availability of regression algorithms, and the estimators in a general framework that allows the analyst to use arbitrary regression machinery were developed. The estimators are doubly robust, n - consistent, asymptotically Gaussian, under slow convergence rates for the regression algorithms used. This allows the use of ﬂ exible machine learning for regression while permitting uncertainty quanti ﬁ cation through con ﬁ dence intervals and p - values. A free and open - source R package implementing the methods is available on GitHub. The proposed estimator to a motivating example from a trial of two medications for opioid - use disorder was applied, where we estimate the extent to which di ﬀ er - ences between the two treatments on risk of opioid use are mediated by craving symptoms.


Introduction
Mediation analyses are paramount to scientific inquiry, as they provide a way of understanding the mechanisms through which effects operate [1]. For example, recent mediation analyses have helped to uncover the types of immune response that coronavirus disease 2019 vaccines trigger in order to prevent disease [2].
Multiple methods have been developed for mediation analysis in the setting of a mediator measured at a single time point, using a counterfactual or potential outcome framework. For example, the natural direct effect is defined as the effect of exposure/treatment that would have been seen in a hypothetical world where the effect operating through the mediator is disabled. Conditions for identifiability for the natural direct effect and its indirect counterpart have been given in [3] and [4].
Though the definition of the natural (in)direct effects is scientifically interesting, their identification requires assumptions about the real world, which are guaranteed to not hold in many practical situations, such as with time-varying confounding (cf. recanting witness, [5]). This restriction limits the applicability of natural direct and indirect effects in practice, as intermediate confounders are expected to be present in many applications, for example, when the effect of an intervention operates through adherence [6]. Several methods have been proposed to do away with the cross-world independence assumption and/or the assumption of no intermediate confounders, for example, partial identification methods [7][8][9], so-called separable effects [7,10], and effects defined in terms of stochastic interventions on the exposure [11,12]. In this article, we tackle this problem focusing on mediation effects defined in terms of contrasts between counterfactuals in hypothetical worlds in which the treatment is set to some value deterministically, whereas the mediator is drawn stochastically from its counterfactual distribution under interventions on treatment [13][14][15]. Efficient non-parametric estimators that leverage machine learning to alleviate model misspecification bias have been recently proposed for these parameters [16,17], but they are limited to single time-point exposures. These effects have been called interventional effects [18].
Mediation analysis methodologies for time-varying exposures and mediators are limited. Non-parametric identification formulas for interventional effects in the case of mediators and treatments measured longitudinally, assuming that the time-dependent covariates are measured either before or after the mediator, but not both, are given in refs [18,19]. The authors propose estimation methods that rely on the unlikely ability to correctly specify parametric models for the distribution of the unobservable counterfactual outcomes. Similar interventional effects where the mediator is drawn from its counterfactual distribution conditional on all the past are proposed in ref. [20], together with non-parametric efficient estimators that rely on data-adaptive regression to alleviate model misspecification bias. However, the "direct effect" defined in ref. [20] does not capture the pathway from treatment through intermediate confounder to outcome and is thus not a direct effect in the sense that we are interested in this article. Similar methods for survival analysis and treatment at a single time point have also been proposed [21]. Marginal structural models for longitudinally measured mediators under treatment at a single time point and no loss-to-follow-up are proposed in ref. [22]. More recent work includes the development of a 4-way decomposition of immediate and delayed effects using an infinite time-horizon in a reinforcement learning framework [23], and other methods considering time-varying mediators but a single exposure at baseline [21,[24][25][26][27].
In this article, we develop a general longitudinal causal mediation approach that fills several gaps in the aforementioned literature. The method we propose satisfies the following: (i) the direct effect is defined in terms of the effects operating through all the pathways that do not include the mediator at any time point, (ii) the methods allow for longitudinally measured mediators and treatments, with confounders possibly measured before and after treatment, (iii) the methods allow the use of data-adaptive regression to alleviate concerns of model misspecification bias, and (iv) the methods are endowed with formulas to construct efficient estimators and computation of valid standard errors and confidence intervals, even under the use of data-adaptive regression. One limitation of prior work remains: our proposed methods can only handle categorical mediators, and the computational complexity increases with the number of categories.
The remainder of this article is organized as follows. In Section 2, we introduce the parameters of interest as well as the identification result and some simple estimators; in Section 4, we discuss efficiency theory for the interventional mediation functional, presenting estimating equations and the efficiency bound in the non-parametric model; in Section 3, we discuss simple estimators based on inverse probability weighting and sequential regression, and in Section 5, we discuss the proposed efficient estimator as well as its asymptotic properties. In Section 6, we present the results of numerical studies, and finally, in Section 7, we present the results of an illustrative study on estimating the longitudinal effect of initiating treatment for opioid-use disorder (OUD) with extended-release naltrexone (XR-NTX) vs buprenorphine-naloxone (BUP-NX) on risk of illicit opioid use during the fourth week of treatment that operates through the mediator of craving symptoms, using symptoms of depression and withdrawal as time-varying covariates. We include weekly measures for each of the first 4 weeks of treatment.
2 Notation and definition of interventional (in)direct effects , where A t denotes a vector of intervention variables such as treatment and/or loss-to-follow-up, Z t denotes intermediate confounders, M t denotes a mediator of interest, and L t denotes the time-varying covariates.
The outcome of interest is a variable = + Y L τ 1 measured at the end of the study. We let P P ( ) ( ) for a given function ( ) f x . We use P n to denote the empirical distribution of … X X , , n 1 and assume P is an element of the non-parametric statistical model defined as all continuous densities on X with respect to a dominating measure ν. We let E denote the expectation with respect to P, i.e., to denote the past history of a variable W , t t τ to denote the future of a variable, and use 1 1 1 to denote the history of all variables up until just before A t . The random variables H Z t , , H M t , , and H L t , are defined similarly as is defined similarly, and we also assume that M t takes values on a finite set. The variables L t and Z t are allowed to take values on any set; i.e., they can be multivariate, continuous, etc.
We formalize the definition of the causal effects using a non-parametric structural equation model [28]. Specifically, for each time point t, we assume the existence of deterministic functions , . Here, is a vector of exogenous variables, with unrestricted joint distribution. This model can be expressed in the form of a directed acyclic graph (DAG) as in Figure 1. In this article, we are concerned with the definition and estimation of the causal effect of an intervention on Ā on Y , as well as its decomposition in terms of the effect through all paths involving the mediator M versus effects through all other mechanisms that do not involve any component of M . Mediation effects will be defined in terms of hypothetical interventions where the equations , , are removed from the structural model, and the treatment and mediator nodes are externally assigned as follows. Let

1)
This is the definition of interventional effect in a longitudinal setting given by ref. [18]. In what follows, we focus on the identification and estimation of the parameters E[ ( ( ))] ′ ⋆ Y a G ā ,¯for fixed ′ ā , ⋆ ā , from which we can obtain the aforementioned effects. In a slight abuse of language, we will also refer to parameters of the type E[ ( ,¯as effects. The aforementioned setup allows for the definition of causal effects for time-to-event outcomes subject to loss-to-follow-up or censoring as follows. Let denotes the exposure at time t, A t 2, is equal to one if the unit remains uncensored at time + t 1 and zero otherwise, and = + Y L τ 1 denotes the event-free status at the end of study follow-up. Assume monotone loss-to-follow-up so that for all > k t, in which case all the data for > k t become degenerate. In this case, we could define the effects as above with as "not censored" for everyone. This definition of causal effect for time-to-event outcomes in terms of interventional effects bypasses some (but not all) problems that occur with natural effects due to the fact that a counterfactual longitudinal mediator may be truncated by death, which may render the counterfactual survival time undefined [29]. Furthermore, recent work has uncovered limitations of these effects as a tool to unveil mechanisms [30]. These and other issues related to the interpretation of these direct and indirect effects are discussed at length elsewhere. Interested readers are encouraged to consult the cited articles.
In what follows, we will use the notation ′ , and ′ H L t , to refer to the intervened histories of the variables under an intervention setting = ′ A ā¯. For example, , and ⋆ H L t , are defined analogously, e.g., Assumption 2. (Positivity of treatment and mediator assignment) Assume: for all t, for all t and m t , for all t and all m t .
Assumptions 1 (i), 1(ii), and 1(iii) together with the assumed DAG in Figure 1 state that H A t , contains all the common causes of A t and Y , H M t , contains all the common causes of M t and Y , and H A t , contains all the common causes of A t and M t , respectively. These are the standard assumptions of no-unmeasured confounders for the treatment-outcome, mediator-outcome, and treatment-mediator relations. Assumptions 2(i) and 2(ii) allow the identification of E[ ( Then, θ is identified as The aforementioned identification theorem result generalizes several identification formulas in the literature. When = τ 1, = ∅ Z , and = V L 1 , this identification formula yields (by computing the corresponding contrasts) the identification formula for the natural direct and indirect effects as derived by ref. [4] under an additional cross-world counterfactual assumption. When = τ 1 and = V L 1 , and the confounder Z is present, this identification formula yields the identification formula for the interventional effect described by [15]. If > τ 1, = V L 1 , and = ∅ Z t for all t, θ in Theorem 1 reduces to Formula (1) in ref. [18]. If > τ 1 and = ∅ L t for ≤ t τ, θ in Theorem 1 reduces to the identification results for Figure 5 given in page 926 of [18]. Our result is a particular case of the identification results of ref. [31].
The choice of variable V in defining the effect should be guided by the total effect to decompose. For instance, if = V L 1 then it may be the case that the total effect in (1) will be closer to the average treatment In what follows, we assume for simplicity in the presentation that = ∅ V , but the estimators presented can be easily adapted to other cases.

Inverse probability weighting and sequential regression
The identification formula in Theorem 1 is a complex functional of the distribution P of the observed data, and it is therefore hard to estimate. Consider, for example, a plug-in estimation strategy where the relevant components of the likelihood are estimated and then plugged in the identifying functional. This would involve finding estimates of the densities of L t and Z t conditional on the past data H L t , and H Z t , . If L t and Z t are continuous or multivariate, or if H L t , and H Z t , are high-dimensional, these densities may be hard to estimate, complicating the construction of a plug-in estimator. Furthermore, the consistency of such an estimator would depend on our ability to consistently estimate the densities involved. In general, this would require the use of flexible estimators that allow automatic selection of non-linearities, interactions, etc. However, there is no statistical theory that allows the study of the sampling distribution (both asymptotic and finite-sample) of such estimator. This results in a lack of a theoretical foundation for the construction of confidence intervals, standard errors, p-values, and other quantities that are often of interest to quantify the uncertainty of statistical estimates.
To address this problem, an estimator based on the theory of doubly robust unbiased transformations [32], targeted minimum loss-based estimation [33,34], and double machine learning [35] was proposed. Concepts in these frameworks will be used to construct estimators that will allow us to use machine learning to address model misspecification bias, while also achieving asymptotic normality, allowing us to construct formulas for computation of standard errors, confidence intervals, and p-values. The asymptotically normal estimators will be constructed in Section 5. This section focuses on describing the two main building blocks: an inverse probability weighted estimator and a sequential regression estimator.

Inverse probability weighted estimation
We start with the description of our proposed estimators by observing that an alternative representation of the identifying functional allows inverse probability-weighted estimation without reliance on estimating the conditional density of L t and Z t . Specifically, define the following random variables: as well as , . Note that we have omitted the dependence on , and m t , and we will do so whenever there is no ambiguity. Then, the identifying functionals in Theorem 1 can also be written as . The aforementioned formulas reveal a simple way to construct an estimator of θ using re-weighting. First, models for the probability distributions of A t and M t could be constructed, e.g., using flexible regression or classification methods. Then, these models are used to compute estimates of the weights K′ τ 1, , H τ 1, , and K ⋆ τ 1, . Then, these weights are used to estimate φ and λ, where the outer expectations of the aforementioned display are estimated using empirical means of weighted outcomes Y and { } = M m¯. These estimators of φ and λ can then be plugged in the formula for θ. While this estimator is attractive for its simplicity, it has two major drawbacks. First, it relies on the ability to correctly estimate the treatment and mediator conditional distributions. If flexible estimators from the regression or classification literature (e.g., generalized linear models with ℓ 1 penalties, and random forests) are used to increase the chances of correct specification, there are no general theoretical results that allow the computation of standard errors of confidence intervals with correctness guarantees. Furthermore, this estimator is generally inefficient in the non-parametric model.
Due to the aforementioned issues, the use of the inverse probability-weighted estimator is generally discouraged in practice. However, the weights K and H will be fundamental to construct efficient and asymptotically normal estimators in Section 5, and for the construction of the efficiency bound in Section 4.

Sequential regression representation of the mediational g-computation formula
The construction of efficient and asymptotically normal estimators in Section 5 will rely on sequential regression procedures to estimate φ and λ. Those estimators are motivated by an alternative representation of θ in Theorem 1 in the form of sequential regressions, which we present in this section. This approach was first proposed by Bang and Robins [36] and has become standard in the estimation of total causal effects in longitudinal studies [34,37]. The sequential regression procedure computes the integrals in Theorem 1 through an alternative representation that iterates in time starting at the last time point of the study = + t τ 1 and ending at = t 0. To start the iteration, set Q = For fixed values ′ ā and m , and for = … t τ, , 0, recursively define the random variables where we note that Q L t , is a random variable through its dependence on H M t , , but that m t is a non-random usergiven value. Likewise, Q Z t , is a random variable only through its dependence on H A t , . To simplify notation, we will sometimes omit the dependence of the aforementioned functions on H A t , , H M t , , and m . In the proof of Proposition 1 (available in the Supplementary Materials), we show that . The counterfactual distribution ( ) λ m may be identified as follows. Let Q = . This leads to the following alternative expression for the mediational gcomputation formula in terms of sequential regressions. . The algorithm requires the estimation of sequential regressions as in (2) and (3). There are at least two alternatives to estimate these sequential regressions. The first is to perform sequential regression separately for each m in the range of M . The second is to construct a pooled dataset where we pool all values m in the range of M to obtain the estimates for all m from a single sequential regression procedure. In this article, we pursue the second approach. In this setup, the values m will play the role of another variable to be included in the regression algorithms. Specifically, note that estimating Q ( ) Under this data pooling approach, estimating Q L,0 in a study with = τ 5 and a mediator taking on three different values involves five sequential regressions, each in datasets of size n 3 , n 3 2 , n 3 3 , n 3 4 , and n 3 5 , respectively, where the predictor set decreases in size as the dataset increases in size. In this setting, smoothing on the mediator values m may be carried out through any regression procedure such as a parametric model or a more flexible regression approach from the machine or statistical learning literature. A detailed algorithm with pseudo-code for computation of a sequential regression estimator is given in Algorithm S1 in the supplementary materials.
If the pooled sequential regressions are performed within a priori correctly specified parametric models, then estimators of θ based on Proposition 1 may be shown to be CAN, and the delta method or the nonparametric bootstrap may be used to construct confidence intervals. However, positing correct parametric models for the sequential regressions involved is generally unattainable a priori. This is especially difficult to do in the longitudinal setting. Even if it was known that the data Z followed a given parametric model, correctly specifying models for the sequential regressions such that they are congenial with the model requires specialized models and might be impossible [38]. Furthermore, in most cases, especially with a large number of variables, the data Z may not be accurately modeled using parametric models and may require data-adaptive regression (e.g., machine learning) tools that offer flexibility in modeling non-linearities and interactions are necessary to attain consistency of the sequential regressions and therefore consistency of the estimator of θ.
Under model selection or data-adaptive regression, the sampling distribution of the aforementioned sequential regression estimator is generally unknown, which hinders computation of confidence intervals and other uncertainty measures. In the next section, we discuss efficiency theory for estimation of θ, which will allow the use of data-adaptive regression techniques while also allowing the computation of valid (under assumptions) standard errors and confidence intervals.

Efficiency theory
The foundations of our proposed efficient estimation approach are in semi-parametric estimation theory [e.g., [39][40][41][42] and in the theory for doubly robust estimation of causal effects using sequential regression [e.g., 33,34,36,37,[43][44][45][46][47]. Central to this theory is the study of the non-parametric efficient influence function (EIF) or canonical gradient, which characterizes the efficiency bound of the longitudinal mediation functional given in Theorem 1 and allows the development of estimators under slow convergence rates for the nuisance parameters involved [41]. Specifically, our estimators will involve finding von-Mises-type approximations for the parameters Q ( ) m L t , , and Q ( ) m M t , , which can be intuitively understood as first-order expansions with second-order error remainder terms. Because the errors in the expansion are second-order, this will mean that the resulting estimator of θ will be consistent and asymptotically normal at rate / n 1 2 as long as the second-order error terms converge to zero at rate / n 1 2 . This convergence rate would be satisfied, for example, if the all regression functions were used for estimation converge at rate / n 1 4 . We will elaborate on this discussion in Section 5 when we present the asymptotic normality theorems for the proposed estimators. Let . For = … t τ 0, , , define , , Whenever necessary, we make explicit the dependence of these functions on η using notation such as .

Lemma 1. (von-Mises-type approximation for Q Z t
, , Q L t , , and Q M t , ) Let η denote an arbitrary value of η. For second-order terms R , we have the following first-order expansions: where we let D are the second-order error terms involving expectations of products of errors such as , , and their explicit form is given in the supplementary materials. This lemma and specifically the second-order form of these remainder terms have important implications in terms of estimation. Specifically, this lemma says that for any value η, which could represent an inconsistent estimator, regressing D is a sum of products of errors, it may be reasonable to assume that this term is small for data-adaptive regression estimators of η. Specifically, in Section 5, consistency and asymptotic normality of the estimators will require that R ( ) η η ,L t , converges to zero in probability at rate − / n 1 2 or faster. The second-order term structure of this remainder term means that this fast convergence rate can be achieved under slower convergence rates for each of the components of η. For example, it will be achievable if all the components of η converge in probability to their correct limits at rate − / n 1 4 . This kind of rate is achievable by several data-adaptive regression methods, such as ℓ 1 regularization [48], regression trees [49], neural networks [50], or the highly adaptive lasso [51]. Analogous considerations apply to the estimation of Q Z t , and Q M t , .
Remark 1. (Generalization to ≠ ∅ V ) When = ∅ V (as we have assumed throughout), Lemma 1 can be used to prove that the function D Z,1 is the uncentered EIF of ( ) φ m . IfV is discrete, EIF of ( ) φ m v , will have an additional term that accounts for conditioning on V . When V is continuous or high-dimensional, ( ) φ m v , is not pathwise differentiable, and EIF does not exist. Nonetheless, the lemma shows that D Z,1 is a doubly robust estimating function for ( ) φ m v , (meaning that the corresponding estimators will have second-order errors, see [32]), regardless of the dimension of V , which implies that estimators of θ that use this lemma for estimating ( ) φ m v , will satisfy an appropriately modified version of the asymptotic properties outlined below.
An application of the delta method along with Lemma 1 yields the following EIF for the estimation of θ in the non-parametric model (see the supplementary materials for a proof). , licensing the construction of Waldtype confidence intervals and hypothesis tests based on the central limit theorem. In the following section, we will describe an algorithm to construct such an estimator.

Efficient estimation using sequential doubly robust regression
Motivated by Lemma 1, we construct a sequentially doubly robust estimator as follows. Consider pseudo- Sequential doubly robust estimators decouple estimation of conditional expectations at time t from consistent estimation of sequences of estimators at time + … t τ 1, , , thereby achieving extra robustness to model misspecification. To introduce sequential double robustness, define the data- , , , where the expectation is with respect to the distribution of X, i.e., the estimator η is considered fixed. Then, we have the following result.
Here, we emphasize that the error terms , only depend on the consistency of the regression procedure used at time point t and not on the consistency of estimators at any time point > s t. To illustrate why this is important, consider an alternative estimation strategy in which the parameters Q L t , , Q Z t , , and Q M t , are estimated directly (i.e., based on (2) and (3) . Note that the functions Q L t , , Q Z t , , and Q M t , are the functions of the sequences , . It therefore appears that consistent estimation of these parameters at time t requires consistent estimation of these future sequences. In fact, an analysis of such a one-step estimator would yield error terms of the form , . Unlike our error terms (which are stated in terms of convergence to Q Z t , ), convergence of these error terms would require consistency of Q L s , and Q + Z s , 1 for all ≥ s t. The sequential regression construction using pseudo-outcomes avoids this problem. This lemma is the result of an application of Lemma 4 in [52]; more discussion on sequential double robustness may be found in [34,37,53].

Cross-fitting and asymptotic normality
Although the estimator of Algorithm 1 is sequentially doubly robust, thereby offering protection against model misspecification, it is not generally asymptotically normal nor efficient. Proving asymptotic normality and efficiency of this estimator would require the use of regression estimators that output functions in Donsker classes, which can be roughly understood as functions with bounded complexity [54]. This would importantly limit the ability to use flexible regression estimators that can further aid to mitigate model misspecification bias.
To solve the aforementioned problem, we will propose a cross-fitted sequentially doubly robust estimator, which helps to avoid assumptions on the complexity of the regression estimators used [35,[55][56][57]. Cross-fitting will allow us to fully exploit the EIF given in Theorem 2 to construct estimators that are efficient and CAN under the use of flexible regression algorithms from the statistical and machine learning literature. These regression estimators are often better than parametric models at capturing the true functional form of the regression functions, therefore making the error terms R ( ) η η, Z,0 , and R ( ) η η, M,0 small. The cross-fitting algorithm is described in Algorithm S2, and the cross-fitted sequential regression algorithm using pooled datasets are described in Algorithm S3 in the supplementary materials. An R package implementing the algorithm is available on GitHub, and simulations testing the correctness of our implementation are given in the supplementary materials Our proposed cross-fitted estimation algorithm satisfies sequential double robustness in the sense of Lemma 2. Furthermore, it satisfies the following weak convergence result.
Then, we have is the non-parametric efficiency bound.
Note that the weak convergence of θ requires a consistent estimation of the sequential regression functions at the rates stated in the theorem. Data-adaptive regression methods avoid reliance on parametric assumptions to achieve consistent estimation. Theorem 3 allows the construction of confidence intervals as where σ 2 is the empirical variance of S( ) X η ;ˆand z α is the quantile of a standard normal distribution.
The main difference between the construction in Algorithm 1 and Algorithm A3 is the introduction of sample splitting and the introduction of formulas to compute standard errors. Sample splitting requires careful handling of the difference training sets (denoted with a in the algorithm) and prediction sets (denoted with a in the algorithm) to ensure that the data for unit i are not used in the training set of regression functions used to make predictions for that unit. This allows the analysis of the estimators by conditioning on the training sample, and then "averaging" across training samples to obtain the final result.

Numerical studies
In this section, we perform a numerical simulation study with three main narrow but important objectives: (1) To test the implementation of the algorithms in the R package lcm and demonstrate that the package produces correct answer in the sense that the estimator performance matches what is expected from theory. (2) To provide an illustration of the performance of the estimator under the practical use case where machine learning is used for the estimation of the nuisance quantities. (3) To illustrate the sequential double robustness of the estimator, which allows the algorithm to recover from inconsistent estimation of the outcome regression.
We achieve these two aims by constructing a Monte Carlo simulation as follows. First, we generate datasets with = τ 2 time points from a distribution with binary A t and Y , and where L t , Z t , and M t are the categorical variables that can take on three levels each. All probability distributions are the functions of the type where W is one of A t , Z t , M t , L t , and Y , and h W t 2, is a vector with all main terms and two-way interactions for h W t . For A t and Y , we set = K 2, and for L t , Z t , and M t , we set = K 3. The coefficients β k used in the simulation may be found on GitHub along with all other simulation codes.
In principle, estimation could be carried out using a non-parametric maximum-likelihood estimator (NPMLE) because the data are categorical. However, the data-generating mechanism consists of × = 3 2 5,832 6 3 categories, which would imply prohibitive sample sizes to obtain an NPMLE with good properties. We therefore use our developed estimators to compute estimates of the direct and indirect effects with ′ = ā 1 and = ⋆ ā 0, using flexible regression methods. We generated 1,000 datasets of sizes { } ∈ n 500, 1,000 , 5,000 and applied the estimators proposed. We then approximated quantities such as bias and coverage probabilities as empirical quantities across simulated datasets.
To achieve the first objective of this numerical study, all of the estimators of nuisance parameters in η consist of saturated logistic regression with an ℓ 1 regularization penalty, where the regularization parameter is chosen with cross-validation. This estimator is a particular case of the highly adaptive lasso estimator (HAL [51]), which has been proven to converge at the rates required by Theorem 3. This implies that under this estimation strategy, a correct implementation of our estimator should yield bias that converges to zero at n -rate, as well as confidence intervals with the nominal coverage. The results of this simulation are presented in Table 1.
The results in Table 1 illustrate that the bias, indeed, converges to zero at rate n , as expected. Nonetheless, the coverage of the confidence intervals is larger than the nominal value. This is due to the use of a non-targeted estimator of the variance of the estimator; this phenomenon has been observed before for this type of estimator (e.g., [58,59]). Alternative variance estimators that alleviate this problem have been proposed for other contexts [60]; the extension of those estimators to our problem will be the subject of future work.
To achieve the second objective of illustrating the performance of the methods in a typical setting under the use of typical data-adaptive regression methods in the statistics and machine learning literature, we fit the regressions in η with an ensemble method known as the super-learner [61], which builds a combination of algorithms in a user-supplied list of algorithms. The weights in the combination are chosen using crossvalidation, and the super-learner has been shown to have certain desirable oracle optimality properties [62]. As candidate learners, we chose two estimators: (i) a logistic or linear (depending on whether the outcome takes values on { } 0, 1 or not) regression model with only main terms and (ii) a tree ensemble built using gradient boosting machines with hyper-parameters tuned using cross-validation. The results are presented in Table 2 and show good performance of the estimators in terms of both coverage of the confidence intervals and bias.
Note that neither of the candidates in the super learner estimator used in this simulation contains the true data-generating mechanisms. However, the candidates in the library are the type of flexible estimators that we encourage users of our approach to use in practice. Therefore, this table serves as an illustration of estimator consistency in practical situations where the outcome sequential regressions may be unknown  but are estimated using flexible regression. We caution, however, that performance of confidence intervals can vary wildly under model misspecification, that our theoretical results do not guarantee correct coverage under general forms of model misspecification, and that our simulation results for coverage and n -bias should not be taken as indicative of general performance. To achieve the third objective of illustrating the sequential double robustness of the estimator, we fit the regressions for the outcome Y using a logistic regression with main terms (inconsistent), the models for A 2 and M 2 using HAL (consistent), the models for L 2 and Z 2 using HAL (consistent), and the models for A 1 and M 1 using logistic regression with main terms (inconsistent). The results in Table 3 illustrate that this scenario leads to estimators with low large-sample bias (possibly consistent), but higher n -bias (possibly not n -consistent), as predicted by theory. As with the results of Table 2, the correct approximately coverage observed in this simulation should not be expected in general misspecification situations.

Illustrative application
We applied our proposed estimators to a longitudinal mediation question from a comparative effectiveness trial of XR-NTX vs BUP-NX for the treatment of OUD [63]. Specifically, we were interested in estimating the extent to which differences in use of illicit opioids during the first month of treatment between the two medications was due to mediation by self-reported craving of opioids, among those completing the detoxification requirement and initiating treatment. This involved estimating the interventional direct effect of being treated with XR-NTX vs BUP-NX on risk of using illicit opioids during the first 4 weeks of treatment, not operating through differences in craving of opioids; and the interventional indirect effect of being treated with XR-NTX vs BUP-NX on risk of using illicit opioids during the first 4 weeks of treatment that did operate through differences in craving. Patients report less opioid use when on XR-NTX vs BUP-NX (as well as methadone) [64,65], but the reasons underlying this difference are not well understood. In this study, patients were randomized to receive XR-NTX or BUP-NX. At time of randomization, a large number (over 30) of baseline covariates, denoted L 1 (listed in the supplemental materials), were measured. Although patient assignment to XR-NTX vs BUP-NX was randomized, we are estimating effects only among those who initiated treatment. Treatment initiation is not randomized and likely depends on patient characteristics. Initiation is also more difficult for those assigned to XR-NTX, because it requires complete detoxification [63]. Our adjustment for an extensive set of possibly confounding variables (L 1 ) helps address lack of randomization in the exposure groups. We use = A 1 1 to denote initiation with XR-NTX and = A 0 1 to denote initiation with BUP-NX. The outcome of this study is opioid use as detected by weekly urine drug screen or timeline followback interview [66]. We use L t to denote opioid use measured at week + t 1 for { } ∈ t 1, 2, 3 , which detects use in the several days prior (via urine drug screen) to week prior (via interview), and where t represents the number of weeks since randomization. We hypothesized that if XR-NTX reduces craving more than BUP-NX, that this could provide a partial explanation of the lower opioid use while on XR-NTX. We use M t to denote craving during week + t 1 since randomization, for { } ∈ t 1, 2, 3 . There may also be differences in depressive symptoms [67] and withdrawal symptoms [68] between the two medications, so we incorporated measures of each (Hamilton depression scale, subjective opioid withdrawal scale; see [69,70]), as time-varying confounders. We use Z t to denote these confounders measured during week + t 1 for { } ∈ t 1, 2, 3 . In addition, patients could drop out of treatment or otherwise be lost to follow-up starting in week 3 after randomization. Importantly, the outcomes L t are always observed, as it is assumed that a patient who has missing opioid-use data (most likely due to a missed visit) would have been positive for opioid use [71,72]. For { } ∈ t 2, 3 , we use = A 1 t to denote that a patient's Z t and M t have been measured, and we let = A 0 t otherwise. Thus, our intervention variable is given by a combination of treatment and censoring, 3 . In summary, we can write our observed data for this example as  Note that the assumption of a negative (or positive) outcome for patients loss-to-follow-up may not be sensible in all applications. In such cases, we encourage the use of a loss-to-follow-up indicator variable and an intervention on loss-to-follow-up as described in Section 2.
We estimated the interventional direct effect of initiating treatment with XR-NTX vs BUP-NX on illicit opioid use during week 4 of treatment, not operating through craving, under a hypothetical intervention to prevent dropout. That is, for ( ) ′ = ā 1, 1, 1 and ( ) = ⋆ ā 0, 1, 1 , we estimated E( ( ( )) ( ( ))) ′ − ⋆ ⋆ ⋆ Y a G a Y a G ā ,¯¯¯,¯¯. We also estimated the interventional indirect effect of initiating treatment with XR-NTX vs BUP-NX on illicit opioid use during week 4 of treatment operating through craving, had those who dropped out not dropped out. That is, we estimated E( ( ( )) ( ( ))) ′ ′ − ′ ⋆ Y a G a Y a G ā ,¯¯¯,¯¯. We used an ensemble of machine learning algorithms in fitting the nuisance parameters. The ensemble included an intercept-only model, lasso, multiple additive regression splines, and extreme gradient boosting. The weights in the ensemble were chosen using super-learning [61]. We used cross-fitting with fivefolds.
For the interventional direct effect, we estimated that initiating treatment with XR-NTX vs BUP-NX, not operating through craving, would reduce risk of using illicit opioids during week 4 of treatment by 8.8 percentage points (risk difference: −0.088, 95% CI: −0.129, −0.048). For the interventional indirect effect, we estimated that initiating treatment with XR-NTX vs BUP-NX, operating through craving, would not meaningfully decrease risk of using illicit opioids during week 4 of treatment (risk difference: −0.004, 95% CI: −0.019, 0.011). Thus, we conclude that reductions in risk of illicit opioid use during treatment with XR-NTX vs BUP-NX are not due to differences that operate through the treatments' effects on craving.

Discussion
Our approach generalizes interventional causal effects to allow for high-dimensional time-dependent variables measured post-and pre-treatment. We present an estimation algorithm that leverages machine learning to alleviate misspecification bias while retaining statistical properties such as n -consistency, non-parametric efficiency, and asymptotic normality.
While this approach allows great flexibility in the data structure and estimation method, some limitations remain. We assume that the mediator is categorical and computational tractability of our proposed estimator requires that it takes values on a small set. This limitation seems fundamental and hard to overcome within an interventional effect framework, as all estimators will require the estimation of the density of the counterfactual variable ( ) M ā¯. We know of no method that can do this non-parametrically in the case of a continuous or high-dimensional variable M , although recent approaches on the estimation of counterfactual densities in single time-point settings are promising [73].
In addition, interventional effects have some limitations. First, they do not decompose the average treatment effect [ ( ) ( )] ′ − ⋆ E Y a Y ā¯. Second, the interventional indirect effect can be non-zero, even when there are no paths of the type → → A M Y in the DAG in Figure 1 [30]. Solving this limitation would require a different framework for mediation analysis. We proposed such a framework for single time-point exposures in ref. [74], and future work will include generalizing that framework to the case of time-varying exposures and mediators.