Skip to content
Publicly Available Published by De Gruyter January 4, 2022

Doubly robust adaptive LASSO for effect modifier discovery

  • Asma Bahamyirou , Mireille E. Schnitzer ORCID logo EMAIL logo , Edward H. Kennedy , Lucie Blais and Yi Yang

Abstract

Effect modification occurs when the effect of a treatment on an outcome differsaccording to the level of some pre-treatment variable (the effect modifier). Assessing an effect modifier is not a straight-forward task even for a subject matter expert. In this paper, we propose a two-stageprocedure to automatically selecteffect modifying variables in a Marginal Structural Model (MSM) with a single time point exposure based on the two nuisance quantities (the conditionaloutcome expectation and propensity score). We highlight the performance of our proposal in a simulation study. Finally, to illustrate tractability of our proposed methods, we apply them to analyze a set of pregnancy data. We estimate the conditional expected difference in the counterfactual birth weight if all women were exposed to inhaled corticosteroids during pregnancy versus the counterfactual birthweight if all women were not, using data from asthma medications during pregnancy.

1 Introduction

Effect modification occurs when the effect of a treatment on an outcome differs according to the level of some pre-treatment variables (the effect modifier, EM). Detecting variables that are EMs is not a straight-forward task even for a subject matter expert. A natural way to assess effect modification in experimental and observational studies is to perform subgroup analysis, in which observations are stratified based on the potential EMs after which stratum-specific estimates are calculated, though this becomes infeasible with a greater number of potential effect modifiers. One can also include the interaction terms between the treatment and the potential EMs in an outcome regression analysis. With observational data however, this approach does not target a parameter of a marginal structural model (MSM) unless a correct model for the outcome conditional on confounders, treatments, and EMs is specified. In contrast, MSMs can provide a summary of how effect modification occurs in the absence of confounding. Different methods for the estimation of effect modification have been proposed recently. For example, Green and Kern [1] used Bayesian Additive Regression Trees (BART) [2] to model the conditional average treatment effects (CATE). Imai and Ratkovic [3] studied EM selection by adapting the support vector machine classifier. Nie and Wager [4] developed a two-step algorithm for heterogeneous treatment effect estimation using the marginal effects and treatment propensities. Lue et al. [5], used dimension reduction techniques to learn heterogeneity by estimating a lower dimensional linear combination of the covariates that is sufficient to model the regression causal effects. Wager and Athey [6] proposed a nonparametric approach for estimating heterogeneous treatment effects using a random forest algorithm [7]. Powers et al. [8], developed an algorithm for heterogeneous treatment effect estimation by adapting the multivariate adaptive regression splines [9]. Zhao et al. [10] introduced an algorithm based on a semiparametric model that selects the EMs by using Robinson’s transformation [11] and Least Absolute Shrinkage and Selection Operator (LASSO). Doubly robust semiparametric methods such as Targeted Minimum Loss-Based Estimation (TMLE) [12, 13], which is closely related to previously existing methods [14, 15] have been proposed. The term doubly robust comes from the fact that the method requires both the estimation of the treatment model and the outcome expectation conditional on treatment and covariates, where only one of which needs to be correctly modeled to allow for consistent estimation of the parameter of interest. However, in a situation where one nuisance parameter is inconsistently estimated, the asymptotic linearity is affected [16]. Lee et al. [17] developed a doubly robust estimator of the CATE along with a uniform confidence band. Rosenblum and van der Laan [13] developed TMLE for MSMs, which can be used to model effect modification, in non-longitudinal settings. Zheng et al. [18] developed TMLE for MSMs with counterfactual covariates in longitudinal settings. Most recently, Kennedy [19] analyzed a version of the pseudo-outcome regression method for CATE estimation and derives model-free error bounds.

In this paper as in [10], we focus on the selection of pre-treatment EMs in a linear MSM for the CATE with a single treatment time-point. Thus, we consider modifiers of the additive effect of a treatment on the mean outcome. We use a component of the efficient influence function of the ATE along with the Adaptive LASSO (Zou, 2006) to select EMs. To the best of our knowledge, our paper is one of the first along with [19, 20] to investigate and apply a doubly robust two-stage regularization for a CATE model. Our estimation approach can be carried out with standard software implementations, is doubly robust (unlike [10]), can accommodate adaptive methods to estimate the nuisance quantities, and produces estimates of the parameters of an easily interpretable model. A two-stage procedure is thus proposed. First, we estimate two nuisance quantities (the conditional outcome expectation and treatment model) and plug these quantities into a specific function to create a pseudo outcome as developed in [2123]. Second, we take the pseudo outcome and apply the adaptive LASSO [24] to select the EMs and estimate the MSM coefficients. We then apply post-selection inference in order to produce interpretable confidence intervals after the EM selection by adaptive LASSO. We perform simulation studies in order to verify the performance (selection, estimation, double robustness, and post-selection inference) of the proposed method.

The remainder of this article is organized as follows. In Section 2, we use the potential outcomes framework to define the target causal parameter of interest and describe our proposed estimation approach. In Section 3, we conduct a simulation study to verify the performance (selection, MSM coefficient estimation, and double robustness) of the proposed method in both low and high dimensional settings. We present an analysis of the safety of asthma medications during pregnancy in Section 4. A discussion is provided in Section 5.

2 Methods

In this section, we present our development of the methodology for the selection of the EMs.

2.1 The framework

The observed data, { ( W i , A i , Y i ) } i = 1 n , are comprised of independent and identically distributed samples of O = ( W , A, Y) ∼ P 0, where W is the baseline covariates of a patient, A is the binary treatment which equals 1 if the patient received treatment and 0 otherwise, and Y is the observed outcome (binary or continuous). Let V represent the subset of the variables in W that represents the potential EMs of interest. We use O i = ( W i , A i , Y i ) to represent the ith observation of the data. In order to define the target parameter, we use the counterfactual framework of Rubin [25]. Let Y a denote the potential (or counterfactual) outcome that would have occurred under the treatment value A = a. In this paper, we focus on marginal models for the CATE. If we assume that we observe Y = Y a when A = a (consistency [26], no interference, positivity and no unmeasured confounders [27]), the CATE can be defined and identified nonparametrically as:

(1) ψ 0 ( V ) = E 0 { Y 1 Y 0 | V } = E W | V { E 0 ( Y | A = 1 , W ) Q ̄ 0 ( 1 , W ) E 0 ( Y | A = 0 , W ) Q ̄ 0 ( 0 , W ) | V } = E W | V { Q ̄ 0 ( 1 , W ) Q ̄ 0 ( 0 , W ) | V }

where E 0 is the expectation with respect to the outcome and E W|V is the expectation conditional on the baseline covariates. In this work, we choose to model the CATE using a linear regression model defined as ψ ̃ 0 ( V ) = β 0 + V T β V where the relevant subset of V will be selected using adaptive LASSO [24]. Our goal here is to identify the true EMs among the set V , and estimate their associated coefficients. One could use non-linear models or machine learning methods to estimate ψ ̃ 0 ( V ) , which is important when the goal is prediction [37] (e.g. for personalized medicine). However, if interpretation of the coefficient associated with each V (s) is important, it may be beneficial to use a linear model rather than a black box approach [28].

2.2 Adaptive LASSO

The adaptive LASSO [24] is an extension of the traditional LASSO of Tibshirani [29] that uses coefficient specific weights. Zou [24] showed that the adaptive LASSO estimator has the oracle property which roughly means that the algorithm identifies the right subset of variables (consistency of variable selection) and that the coefficient estimators of the selected variables are asymptotically normal. In a prediction (non-causal) setting, let Y be an observed outcome and V a set of covariates. Under the linear model, we can select predictors of Y by solving the equation below:

(2) arg min α , β i = 1 n Y α V i T β 2 + λ j = 1 p w ̂ j | β j |

where β = β 1 , , β p , w ̂ j = 1 / | β ̃ j | γ , for some γ > 0 and β ̃ j is a n -consistent estimator of β j . The selected variables are the positions of the non-zero entries of the solution of (2). When the sample size grows, the weights associated with the zero-coefficient predictors tend to infinity, while the weights corresponding to true predictors converge to a constant. Thus, true-zero coefficients are less likely to be selected by the adaptive LASSO than by the standard LASSO, which does not have the oracle property [24].

2.3 Highly adaptive LASSO (HAL)

Assume E(Y|V) a regression function where Y is the observed outcome and V is the set of covariates. Consider a map of V onto a set of binary indicator basis functions. For example, if V is scalar, we generate for an observation v, ϕ * ( v ) = ( ϕ 1 * ( v ) , , ϕ n * ( v ) ) T , where ϕ i * ( v ) = I ( v V i ) , for i = 1, …, n. With two dimensions, V = ( V ( 1 ) , V ( 2 ) ) T , we need to include the second order basis functions ϕ i * ( v ) = I ( v 1 V i ( 1 ) , v 2 V i ( 2 ) ) , for i = 1, …, n. The HAL estimator [30] is obtained by fitting a L 1-penalized regression of the outcome Y on these basis functions, with the optimal L 1-norm chosen via cross-validation. The HAL estimator of the regression function E(Y| V ) converges to the true regression function in L 2-norm no slower than n −1/4 regardless of the dimension of V , under the assumption that the regression function has bounded variation norm.

2.4 Selective inference

Let β ̂ be the solution of (2) and β ̂ M ̂ the non-zero subvector of β ̂ where M ̂ { 1 , , p } corresponds to the positions of the non-zero entries. Suppose that we are interested in making inference for β ̂ M ̂ in the prediction model of Section 2.2. A naive way to obtain inference after selecting the covariates in the model is the standard hypothesis tests for linear regression that treat M, representing the non-zero entries of β ′ and thus the true model, as known. It is easy to see that β ̂ depends on the selected model M ̂ . Therefore, Lee et al. [31] studied the conditional distribution β ̂ M | { M ̂ = M } and showed that this conditional distribution is a truncated normal Gaussian. They constructed a pivotal statistic for β ̂ M ̂ which can be used for hypothesis testing and therefore by test inversion, to construct a confidence interval. Let F(yμ, σ 2, l, u) be the CDF of a normal N(μ, σ 2) truncated to the interval [l, u], e j the unit vector for the jth coordinate so that β ̂ M j = η M T Y , η M = V M T V M 1 V M T T e j and σ * 2 = σ 2 η M T η M . In the linear regression setting where Y N μ , σ 2 I n , Lee et al. [31] showed that F ( β ̂ M j ; β M j , σ * 2 , ν , ν + ) | { M ̂ = M } Unif ( 0,1 ) , where [ν , ν +] is defined in [31] as a function of Y and the model M. By inverting the hypothesis testing, we can find a (1 − α) confidence interval for β ̂ M j , conditional on M ̂ = M , by finding [L*, U*] such that

F ( β ̂ M ̂ j ; L * , σ ̂ * 2 , ν , ν + ) | { M ̂ = M } = 1 α / 2

and

F ( β ̂ M ̂ j ; U * , σ ̂ * 2 , ν , ν + ) | { M ̂ = M } = α / 2

In this next section, we will explain how this result is applied in our setting.

2.5 The model

2.5.1 Model definition

Let ψ 0( V ) = E 0{Y 1Y 0| V } be the CATE. Denote Q ̄ 0 ( a , W ) = E 0 ( Y | A = a , W ) , the outcome expectation, and g 0(a| W ) = P(A = a| W ) as the propensity score. We suggest to use the doubly robust and efficient loss-function proposed by van der Laan [21], inspired by Rubin and van der Laan [32], L Q 0 , g 0 ( ψ ) ( O ) = ( D ( Q ̄ 0 , g 0 ) ( O ) ψ 0 ( V ) ) 2 where

(3) D ( Q ̄ 0 , g 0 ) ( O ) = 2 A 1 g 0 ( A | W ) ( Y Q ̄ 0 ( A , W ) ) + Q ̄ 0 ( 1 , W ) Q ̄ 0 ( 0 , W )

is indexed by the nuisance parameters ( Q ̄ 0 ; g 0 ) . A similar pseudo-outcome is also used in Zhao et al. [22] for estimating optimal individualized treatment rules and Kennedy et al. [23] for the estimation of continuous treatment effects.

The next lemma shows that if one of the two nuisance quantities are consistent, the CATE can be obtained by the conditional expectation of the estimated pseudo-outcome.

Lemma 1

Let f 2 , P 0 2 = f ( z ) 2 d P 0 ( z ) denote the L 2(P) norm. Suppose either Q ̄ n converges to Q ̄ 0 or g n converges to g 0 in the sense that E Q ̄ n Q ̄ 0 2 = o ( 1 ) or Eg n g 02 = o(1) (not necessarily both). Then E ( D ( Q ̄ n , g n ) ( O ) | V ) ψ 0 ( V ) as n → ∞.

The preceding lemma shows that the pseudo-outcome we propose for the CATE is doubly-robust in the sense that if at least one nuisance estimator ( Q ̄ n or g n ) converges to the correct function, but not necessarily both, then a regression of the pseudo-outcome onto the effect modifiers will be consistent for the CATE. Adding and subtracting the true CATE is the key idea to prove Lemma 1. Then, the regression function of the pseudo-outcome on V can be split into two terms: the true CATE and a second term that is a function of both Q ̄ n Q ̄ 0 and g n g 0. See the Appendix for the Proof of Lemma 1.

Suppose that an investigator would like to identify the true EMs amongst multiple suspected effect modifying variables V = (V (1), …, V (p)). As described above, to accomplish this we use a linear model for the CATE with corresponding MSM defined as ψ ̃ 0 ( V ) = β 0 + V T β V under a least squared error loss function. We then use the adaptive LASSO estimator [24] to select amongst the V (j)s. More specifically, as suggested by Rubin and van der Laan [33], we penalize the aforementioned loss function L Q ̄ 0 , g 0 by the adaptive LASSO penalty. Let D n = D ( Q ̄ n , g n ) ( O ) be the estimated pseudo outcome. The parameters of the MSM β = (β 0, β 1, …, β p ) are estimated by minimizing the risk function below:

(4) β ̂ = arg min β i = 1 n ( D i , n ψ ̃ 0 ( V i ) ) 2 + λ j = 1 p w ̂ j | β j |

where ω ̂ j = 1 / | β ̃ j | γ , for some γ > 0 and β ̃ j is a n -consistent estimator of β j .

An optimal method would possess the oracle property, able to select the appropriate variables and unbiasedly estimate the selected parameters. Let A be the set of true variables in the model and A n * be the set selected using adaptive LASSO.

Lemma 2

Let D n = D ( Q ̄ n , g n ) be the estimated pseudo-outcome conditional on the estimated nuisance functions. Assume E(D n | V ) = β 0 + V T β V and |A| = p 0 < p. Suppose that λ / n 0 and λn (γ−1)/2 → ∞. Also, assume D n is obtained by cross-fiting and is consistent in the sense that it belongs to a shrinking neighborhood of D 0 as given in Assumption 3.5 in Semenova and Chernozhukov (2020) [20]. The proposed estimator β ̂ inherits the adaptive LASSO oracle properties, i.e.

  1. Consistency in variable selection (i.e. identifies the right subset model):

lim n P A n * = A = 1 .
  1. Asymptotic normality (i.e. has the optimal estimation rate): n ( β ̂ A β A ) d N ( 0 , Σ * ) , where Σ* is the covariance matrix knowing the true subset model and β ̂ A is the coefficient estimates resulting from the Adaptive LASSO regression of D n on V.

As a consequence, our proposed estimator is able to select the correct subset of EMs and produce an unbiased estimate of the MSM coefficients in large samples. This relies on convergence of D n to D 0 which can result from correct specification of the models for g n and/or Q ̄ n [20]. See the Appendix for the Proof of Lemma 2.

2.5.2 Estimation

In this paragraph, we describe how our proposal can be easily implemented in a two-stage procedure. In the first stage, we construct the pseudo-outcome function by producing estimates Q ̄ n ( a , W ) and g n (a| W ) of the two nuisance quantities and plugging them into D. Machine Learning (ML) methods are often recommended [13] for estimating Q ̄ n and g n . In the second stage, we run the adaptive LASSO regression of the estimated pseudo-outcome D ( Q ̄ n , g n ) ( O ) on the set V . The selected EMs correspond to the non-zero coefficients of the adaptive LASSO regression.

The proposed algorithm for estimating the parameters in the CATE model with a given value of λ is as follows:

Algorithm 1:

Effect modifiers adaptive LASSO algorithm.

1: Estimate the outcome expectation Q ̄ n ( a , W ) = E ̂ ( Y | A = a , W ) for each subject.
2: Obtain the estimated propensity score g n ( a | W ) = P ̂ ( A = a | W ) for each subject.
3: Construct an estimate of the doubly robust function D n by plugging in the estimated Q ̄ n and g n .
4: Select the effect modifiers by following steps (a)–(d) below:
 (a) Run a linear regression of D n on V as the set of covariates. Obtain β ̃ j , the estimated coefficient of V (j), j = 1, …, p.
 (b) Define the weights ω ̂ j = 1 | β ̃ j | γ , j = 1, …, p for some γ > 0.
 (c) Run a LASSO regression of D n on V with ω ̂ j as the penalty factor associated with V (j) with a given λ.
 (d) The non-zero coefficients of the solution of the adaptive LASSO regression { β ̂ j } j = 1 p are the selected effect modifiers.
5: The final estimate of the CATE is ψ n ( V ) = β ̂ 0 + j = 1 p V ( j ) β ̂ j .

For the adaptive LASSO tuning parameters, we choose γ = 1 (Nonnegative Garotte Problem [34]) and λ is selected using cross-validation as suggested by Zou [24]. The traditional cross-validation minimizes the prediction error knowing the true outcome. In our setting, the Adaptive LASSO is run with the estimated pseudo-outcome as the “true” outcome. We conjecture that if the two nuisance parameters are consistently estimated at fast enough rates, we should be able to use the estimated pseudo-outcome to find an optimal tuning parameter. This conjecture agrees with recent results from Kennedy (2020) [19]. Naive inference by ignoring the EM selection would result in incorrect confidence intervals. Zhao et al. [10] showed that when the outcome is observed with error, the selective pivotal statistic proposed by Lee et al. [31] is still asymptotically valid. Thus we apply their methodology which is expected to produce valid asymptotic results as long as Q ̄ n is consistent and both Q ̄ n and g n converge faster than at a n 1/4 rate in the l 2 norm [35]. In order to construct a selective 95%-confidence intervals for the selected submodel, we use the R package selectiveInference [36] for post-selection inference. The estimated σ ̂ 2 used in the package is the variance of the residual from fitting the full model in 4(a).

3 Simulation study

3.1 Data generation and parameter estimation

To evaluate the performance of the proposed method in finite samples, we conducted a simulation study under four scenarios. We simulated data O = ( W , A, Y) representing baseline covariates W , a binary exposure A, and a continuous outcome Y. The baseline covariates W include three confounders (X, V (1), V (2)), one instrument Z (pure cause of treatment), and two pure causes of the outcome (V (3), V (4)). All covariates were generated independently with the Bernoulli distribution with success probability p: XB(p = 0.4), V (1)B(p = 0.5), V (2)B(p = 0.6), V (3)B(p = 0.5), V (4)B(p = 0.7) and ZB(p = 0.45).

We varied the strength of the relationship between covariates, outcome and treatment across three low-dimensional scenarios. In the first, we used an outcome model where the covariates were strongly predictive, and a treatment model where the covariates were weakly predictive. The treatment mechanism g 0 was set as a Bernoulli with the probability generated linearly in the three confounder variables and single instrument,

P 0 ( A = 1 | X ) = e x p i t { 0.5 Z 0.2 X + 0.3 V ( 1 ) 1 + 0.4 V ( 2 ) }

where expit(x) = 1/{1 + exp(−x)}. The observed continuous outcome Y was linearly generated as:

Y = 1 + A 0.5 X + 2 V ( 1 ) + V ( 2 ) + V ( 3 ) 0.2 V ( 4 ) + 4 V ( 1 ) V ( 2 ) V ( 3 ) + A ( 0.5 V ( 1 ) + V ( 3 ) ) + N ( 0,1 )

The effect modification arises due to interaction between treatment and covariates.

The second scenario has the same data generation except that the coefficient of the interaction term V (1) V (2) V (3) is 0 instead of 4. In the third scenario, we use an outcome model where the covariates are weakly predictive, and a treatment model where the covariates are strongly predictive. We focus here on the first scenario and describe all other simulations settings and results in the Appendix.

We thus have two EMs (V (1), V (3)), where the first is a confounder and the second is a pure cause of the outcome. In practice, we are not aware of the true data generating mechanism. So we have a potential set of EMs: V = (V (1), V (2), V (3), V (4)). Let ψ 0 ( V ) = E P 0 ( Y 1 Y 0 | V ) be the true (nonparametric) CATE, which we model as an MSM: ψ ̃ 0 ( V ) = β 0 + β 1 V ( 1 ) + β 2 V ( 2 ) + β 3 V ( 3 ) + β 4 V ( 4 ) . Our goal here is to identify among the set V , the true EMs and estimate their associated coefficients. Given the data generated, the true values of the coefficients are β v = (0.5, 0, 1, 0). We set n = 1000 and then 10 000. We also add a smaller sample size n = 100 with results in the appendix.

To evaluate the performance of our method in high-dimensional settings, we also extend the first scenario by adding 50 pure binary noise covariates (unrelated to treatment or outcome) to our set of covariates, which are included as potential confounders and EMs. The true values of the coefficients in the MSM are thus β v = (0.5, 0, 1, 0, …, 0).

Under each low-dimensional scenario, we tested our proposed method under four different implementations:

  1. Qcgc: Both of the models for Q ̄ and g are correctly specified using generalized linear models (GLMs).

  2. Qc: Only the GLM for Q ̄ is correctly specified. g is misspecified using a logistic regression of treatment A on variable X.

  3. gc: Only the GLM for g is correctly specified. Q ̄ is misspecified using a GLM of treatment Y on variables A and V (3).

  4. HAL: Both Q ̄ and g are estimated using the Highly Adaptive LASSO (HAL) [30, 37]. We use the package default setting.

For comparison, we also tested two implementations of a linear regression model for the outcome to directly assess effect modification:

  1. NLin: Linear regression with main terms (treatment and all covariates) and interactions between treatment and covariates. Only first-order interactions were included.

  2. CLin: Linear regression with a correctly specified outcome model.

Standard confidence intervals are presented for the linear model case and, in our summary, a p-value of less than 0.05 is used as a criterion for a variable to be selected.

In the higher dimensional scenario, only HAL was used to estimate Q ̄ and g.

3.2 Simulation results

For each scenario, we produced boxplots of the MSM coefficient estimates. We also present the percent selection, the coverage proportion of the confidence intervals and the false coverage rate in order to summarize the average performance of each estimator and implementation. The percent selection for our LASSO method was obtained as the percentage of estimated coefficients that are non-zero throughout the 1000 generated datasets, and for the linear regression, the percentage of p-values < 0.05 . The coverage for each true effect modifier was obtained as the number of times the true model was selected and the corresponding confidence intervals contained the true coefficients, divided by the number of times the true model was selected. For the linear regression, the percent coverage was instead calculated for each coefficient and defined as the proportion of the confidence intervals that contained the true coefficient throughout the 1000 generated datasets. The false coverage rate (FCR) for our LASSO model was obtained as the number of non-covering confidence intervals among the selected coefficients, divided by the number of the selected coefficients throughout the 1000 generated datasets [31].

For the first low-dimensional scenario, Figures 1 and 2 contain the boxplots of the MSM coefficient estimates for the true EMs V ( 1 ) , V 3 and non-EMs (V (2), V (4)), respectively. Table 1 (in the Appendix) contains the numerical results. As shown in the first two boxplots in Figures 1 and 2, the implementations (1) Qcgc and (2) Qc performed very well. We obtained unbiased estimates and confidence interval coverage that tended to be around 95% as sample size increased as shown in Figure 5. The FCR was close to the optimal 0.05. In the third boxplot, corresponding to implementation (3) gc, where only the propensity score was correctly specified, the estimator was more biased for both sample sizes but had higher coverage rates and lower FCR. In the fourth boxplot where the estimator was implemented with HAL, the estimator performed well across all measures. Overall, as shown in Figure 5, the percent coverage of the true EMs V ( 1 ) , V 3 was around the nominal 95% as sample size increased or when at least the outcome model was correclty specified or machine learning methods were used to estimate both nuisance parameters. In all implementations the true effect-modifiers (V (1), V (3)) were selected around 100 percent of the time except when only the propensity score was correctly specified for the smaller sample size (gc). The percent selection of variables that are not effect-modifiers (V (2), V (4)) was around 20% for n = 1000. In implementations (1), (2), and (4), the percentage was almost halved for n = 10,000. The FCR was controlled around the nominal 0.05 level in all situations even when only one nuisance model was correctly specified. This supports the double robustness of the proposed estimator and the appropriateness of the post-selection confidence intervals. In implementation (5) NLin, the naive linear model with a misspecified term performed poorly, even when increasing the sample size. On the other hand, when the linear model was correctly specified in implementation (6) CLin, the coefficient estimates were unbiased on average and the coverage was near-optimal. For the two other data generating scenarios described at more length in the Appendix, the results (Tables 2 and 3) look similar to those in the first scenario.

Figure 1: 
Simulation results illustrations (data generating scenario 1). Box plots of 1000 MSM coefficients estimates for the true EMs 




V



(

1

)



,


V


3




${V}^{\left(1\right)},{V}^{3}$



. The true values of the coefficients are (0.5, 1). Notation: Qcgc: Models for 





Q

̄




$\bar{Q}$



 and g are correctly specified, Qc: 





Q

̄




$\bar{Q}$



 is correctly specified, gc: g is correctly specified, HAL: 





Q

̄




$\bar{Q}$



 and g are estimated with HAL, NLin: Naive linear model, CLin: Correct linear model. %sel: percent selection of a covariate × 100, %cov: coverage rate of the confidence interval of a coefficient estimate × 100, FCR: False coverage rate of the model × 100.
Figure 1:

Simulation results illustrations (data generating scenario 1). Box plots of 1000 MSM coefficients estimates for the true EMs V ( 1 ) , V 3 . The true values of the coefficients are (0.5, 1). Notation: Qcgc: Models for Q ̄ and g are correctly specified, Qc: Q ̄ is correctly specified, gc: g is correctly specified, HAL: Q ̄ and g are estimated with HAL, NLin: Naive linear model, CLin: Correct linear model. %sel: percent selection of a covariate × 100, %cov: coverage rate of the confidence interval of a coefficient estimate × 100, FCR: False coverage rate of the model × 100.

Figure 2: 
Simulation results illustrations (data generating scenario 1). Box plots of 1000 MSM coefficients estimates for the non-EMs 




V



(

2

)



,


V


4




${V}^{\left(2\right)},{V}^{4}$



. The true values of the coefficients are (0, 0). Notation: Qcgc: Models for 





Q

̄




$\bar{Q}$



 and g are correctly specified, Qc: 





Q

̄




$\bar{Q}$



 is correctly specified, gc: g is correctly specified, HAL: 





Q

̄




$\bar{Q}$



 and g are estimated with HAL, NLin: Naive linear model, CLin: Correct linear model. %sel: percent selection of a covariate × 100, %cov: coverage rate of the confidence interval of a coefficient estimate × 100, FCR: False coverage rate of the model × 100.
Figure 2:

Simulation results illustrations (data generating scenario 1). Box plots of 1000 MSM coefficients estimates for the non-EMs V ( 2 ) , V 4 . The true values of the coefficients are (0, 0). Notation: Qcgc: Models for Q ̄ and g are correctly specified, Qc: Q ̄ is correctly specified, gc: g is correctly specified, HAL: Q ̄ and g are estimated with HAL, NLin: Naive linear model, CLin: Correct linear model. %sel: percent selection of a covariate × 100, %cov: coverage rate of the confidence interval of a coefficient estimate × 100, FCR: False coverage rate of the model × 100.

Table 4 in the Appendix contains the results with the small sample size n = 100. The performance of the proposed methods decreased across all measures except for V (3) where there was a higher coverage rate when Q ̄ and g were correctly specified or estimated with HAL. The results of the high-dimensional setting are presented in Figures 3 and 4. Q ̄ and g were estimated with HAL. The estimates were taken over 100 generated datasets and look similar to Figures 1 and 2 for the covariates V = (V (1), V (2), V (3), V (4)) in common. For the noise covariate coefficients, the estimates, given in the density plot of Figure 4, were unbiased for 0. The noise covariates had a low percent selection (see Table 5). Using median statistics, the noise covariates were selected around 14% of the time and that proportion decreased to 13% as we increased the sample size. The FCR exceeded the nominal 5% level and was around 15%.

Figure 3: 
Simulation results for high-dimensional setting (data generating scenario 1). Box plots of MSM coefficients estimates over 100 simulations for the potentials EMs 
V
 = (V
(1), V
(2), V
(3), V
(4)). The true values of the coefficients are (0.5, 0, 1, 0). Notations: HAL: 





Q

̄




$\bar{Q}$



 and g are estimated with HAL, %sel: percent selection × 100, %cov: coverage rate × 100, FCR: False coverage rate × 100.
Figure 3:

Simulation results for high-dimensional setting (data generating scenario 1). Box plots of MSM coefficients estimates over 100 simulations for the potentials EMs V = (V (1), V (2), V (3), V (4)). The true values of the coefficients are (0.5, 0, 1, 0). Notations: HAL: Q ̄ and g are estimated with HAL, %sel: percent selection × 100, %cov: coverage rate × 100, FCR: False coverage rate × 100.

Figure 4: 
Illustrations for high-dimensional setting. Box plots of the MSM coefficients estimates over 100 simulations for the 50 noise covariates (for both n = 1000 and n = 10,000). The true values of the coefficients are (0, …, 0).
Figure 4:

Illustrations for high-dimensional setting. Box plots of the MSM coefficients estimates over 100 simulations for the 50 noise covariates (for both n = 1000 and n = 10,000). The true values of the coefficients are (0, …, 0).

In summary, Table 1 demonstrates that in low-dimensional settings, the proposed algorithm is able to produce unbiased estimates and control the FCR around the nominal level. In contrast, Table 5 demonstrates that in the context of high-dimensional covariates with many candidate EMs, the FCR is generally much larger than the nominal level. Similar results were obtained by Zhao et al. ([10], Figure 2). In addition, at least some non-EMs were always selected by the algorithm at the sample sizes investigated.

4 Data analysis: asthma medication during pregnancy

4.1 Data

Our data were obtained from a cohort (Firoozi et al. [38]) of deliveries of pregnant women with asthma in order to study the effect of using inhaled corticosteroids (ICS) during pregnancy on birth weight. The population of interest is pregnant women with mild asthma and a singleton delivery in Québec, Canada between 1998 and 2008, aged 45 years. For simplicity, We considered only the first delivery for each woman in this period. Asthma severity was defined according to an index that is based on the Canadian Asthma Consensus Guidelines (Cossette et al. [39]). A total of 4707 pregnancies in our database fell into this category. ICS exposure was classified in two categories: “use”(a woman who filled at least one prescription of ICS during pregnancy) and “no use”(a woman who did not fill any prescription of ICS during pregnancy). The outcome of interest is birth weight (continuous in kilograms). We identified a variety of maternal baseline variables. These potential confounders measured in the year before pregnancy include demographic characteristics (e.g. income security provider and place of residence), chronic diseases (e.g. hypertension and diabetes) and variables related to asthma (e.g. at least one hospitalization for asthma, at least one emergency department visit for asthma, and oral corticosteroids). We also included the cumulative daily dose of ICS in the year before pregnancy and sex of the newborn as potential confounders. A full list of measured potential confounders can be found in Table 6 in the Appendix. As we do not know which variables are effect modifiers, we included a wide range of variables in the set V , 22 variables in all. Specifically, these variables were: In the year before pregnancy: at least one dose of inhaled short-acting β 2-agonists (SABA) taken per week, medication for epilepsy, use of warfarin, use of beta blockers, asthma exacerbation, oral SABA use, oral corticosteroids, leukoteriene-receptor antagonists, intranasal corticosteroids, at least one hospitalization for asthma, at least one emergency department visit for asthma, and welfare recipient; At the start of the pregnancy: chronic obstructive disease, cyanotic heart disease, obesity, uterine disorder, antiphospholipid syndrome, sex of the newborn, rural/non-rural residence indicator, hypertension, diabetes, and chromosomal anomalies.

For our pregnancy cohort, the average treatment effect is the expected difference in the mean counterfactual birth weight if all women were exposed to ICS during pregnancy versus the counterfactual birth weight if all women were not [40]. The target parameters are the coefficients β j , j = 1, …, 22 of the MSM defined as: ψ ̃ 0 ( V ) = β 0 + j = 1 22 V ( j ) β j , with V = (V (1), …, V (22)) the set of potential EMs. Taking the sex of the newborn as an EM for example (V (j) = sex), β j is the difference in the CATE for women having male versus female children.

4.2 Results

Baseline characteristics of the pregnancy cohort are presented in Table 6. We first implemented a standard linear regression with main terms for all potential confounders and interaction terms between the treatment and the set V . The estimates of the coefficients of the interaction terms are given in Table 7. A variable was considered to be selected as an EM in the standard linear regression if the coefficient of the interaction term between that variable and the treatment had a p-value < 0.05 . This model concluded that leukoteriene-receptor antagonists and chromosomal anomalies are EMs. In addition, we implemented our LASSO methods using HAL for the estimation of the outcome expectation and propensity score. All of the covariates were included in the propensity score model as well as in the outcome model. Due to larger weights, a 5% truncation for the values of g n was used. The selected coefficients of the MSM and their estimated values are presented in Table 8. Three covariates (leukoteriene-receptor antagonists, warfarin one year before pregnancy, and chromosomal anomalies) were selected using the adaptive LASSO and two of them were significant (leukoteriene-receptor antagonists and chromosomal anomalies) using post-selection inference. Leukoteriene-receptor antagonists and chromosomal anomalies were thus selected as EMs in the association of taking ICS during pregnancy on birth weight. Although the naive linear model and our algorithm generate very similar sets of EMs, the coefficients of the selected EMs are different (compare Table 7 with Table 8). For example, the estimated coefficient of leukoteriene-receptor antagonist is around −0.17 in the adaptive LASSO while it is −0.365 using the linear model.

5 Discussion

In this paper, we proposed a doubly robust estimator for selecting effect modifiers (EMs) in an MSM for the CATE. We used the post selection inference method of Lee et al. [31] to produce post-selection confidence intervals.

Through simulation studies, we studied the performance of the proposed estimator. As well, we showed that our proposed estimator is doubly robust and performs well in a high dimensional setting but had a higher FCR along with an over-selection of non-EMs. We observed a slower convergence of our estimator when the outcome expectation model was misspecified. Work by Ju et al. [42] suggests that better performance might be obtained by incorporating outcome-inverse weighting in the penalty term when using HAL to estimate the propensity score. We also illustrated that the post-selection confidence interval produces good coverage proportions for the selected EMs. In a high dimensional case, we confirmed the observation of Zhao et al. [10] concerning the FCR which exceeded the nominal level in the presence of many noise covariates. Debiased Lasso [41] could be considered here in a high dimensional case as proposed in Zhao et al. (2017). In general, the overall performance of our estimator improved with the sample size. However, the blind usage of traditional methods like a regression with main terms and interactions between treatment and potential effect modifiers may produce biased results.

We also show theoretically that our estimator is doubly robust and also inherits the oracle properties of the adaptive LASSO. Linearity and sparsity are assumptions of Lemma 2 and they may be restrictive. However, by modeling the conditional average treatment effect on the linear scale, we are investigating effect modification on the absolute scale (difference between means) which is recommended [43]. If the linear model is too restrictive for some applications, we could increase the model capacity by adding higher order terms and interaction terms. Another option could be to use non-linear models or machine learning methods to model the pseudo-outcome ψ ̃ 0 ( V ) . Because of the difficulty for stakeholders to interpret a black-box marginal model [28], this approach may not be desireable when the goal is to discover effect modifiers and fit an interpretable model. Machine learning approaches may be more appropriate when the goal is identifying optimal treatment rules.

In our application, the results suggest that leukoteriene-receptor antagonists and chromosomal anomalies may modify the effect of ICS during pregnancy on birth weight for women with mild asthma. The estimated CATE is 0.18 lower for women taking leukoteriene-receptor antagonists. As leukoteriene-receptor antagonists are an addition to ICS, we can suppose that it is a marker for more severe asthma. In the presence of a chromosomal anomaly, the effect of ICS was estimated to be 0.78 lower. The linear regression with standard significance testing suggested the same but with different coefficient estimates. Such discrepancy may be due to the fact that the naive model doesn’t target MSM parameters and thus may not be able to model effect modification in the absence of confounding. In this finite sample setting, the regularization may possibly have shrunk the coefficient values relative to the truth. Our results point to the importance of using robust methodologies for selecting effect modifiers in well-defined causal models for estimating the conditional treatment effect.


Corresponding author: Mireille E. Schnitzer, Faculté de pharmacie, Université de Montréal, Pavillon Jean-Coutu, 2940 ch de la Polytechnique, Office #2236, Montreal, QC, Canada, E-mail:

Funding source: Natural Sciences and Engineering Research Council of CanadaHealth ResearchUniversité de Montréal

Award Identifier / Grant number: Unassigned

Award Identifier / Grant number: Unassigned

Award Identifier / Grant number: Unassigned

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This work was supported by the Natural Sciences and Engineering Research Council of Canada (Discovery Grant and Accelerator Supplement to MES), the Canadian Institutes of Health Research (New Investigator Salary Award to MES) and the Faculté de pharmacie at Université de Montréal (funding for AB and MES).

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

Appendix

In the Appendix, we give the numerical results of the simulation study, the baseline characteristics of our pregnancy data, the results of our application and the proofs of the two lemmas (Table 9).

Figure 5: 
Percent coverage of the selective confidence interval associated to V
1 and V
3 for different sample size. Notation: Qcgc: Models for 





Q

̄




$\bar{Q}$



 and g are correctly specified, Qc: 





Q

̄




$\bar{Q}$



 is correctly specified, gc: g is correctly specified, HAL: 





Q

̄




$\bar{Q}$



 and g are estimated with HAL.
Figure 5:

Percent coverage of the selective confidence interval associated to V 1 and V 3 for different sample size. Notation: Qcgc: Models for Q ̄ and g are correctly specified, Qc: Q ̄ is correctly specified, gc: g is correctly specified, HAL: Q ̄ and g are estimated with HAL.

Table 1:

Simulation results (Data generating scenario 1).

Coef EM n = 1000 n = 10,000
β ̂ V %sel %Cov FCR β ̂ V %sel %Cov FCR
(1) Q ̄ & g model are correctly specified
V 1 T 0.46 98 96 5 0.49 100 95 6
V 2 F 0.00 21 0.00 12
V 3 T 0.98 100 95 0.99 100 95
V 4 F 0.00 21 0.00 13
(2) Q ̄ model is correctly specified
V 1 T 0.46 99 96 6 0.49 100 95 6
V 2 F 0.00 21 0.00 11
V 3 T 0.98 100 94 0.99 100 96
V 4 F 0.00 19 0.00 12
(3) g model is correctly specified
V 1 T 0.31 55 95 2 0.47 99 100 2
V 2 F 0.01 19 0.00 14
V 3 T 0.83 92 100 0.99 100 99
V 4 F 0.00 26 0.00 22
(4) Q ̄ & g model are estimated using HAL
V 1 T 0.46 99 95 6 0.49 100 95 6
V 2 F 0.00 21 0.00 12
V 3 T 0.98 100 94 1.00 100 95
V 4 F 0.00 22 0.00 13
(5) Naive linear model
V 1 T 0.69 95 83 19 0.69 100 11 65
V 2 F 0.15 12 88 0.15 67 33
V 3 T 1.35 100 56 1.36 100 0
V 4 F 0.01 37 96 0.00 47 95
(6) Linear model correctly specified
V 1 T 0.50 97 96 5 0.50 100 95 4
V 2 F 0.00 6 94 0.00 5 95
V 3 T 1.00 100 95 1.00 100 95
V 4 F 0.00 4 96 0.00 4 96
  1. Estimates taken over 1000 generated datasets. β ̂ V : average estimated value of the coefficients of the MSM, %Cov: percent coverage of the selective confidence interval × 100 (Standard CI for the linear model case), %sel: percent selection of variables × 100, FCR: False coverage rate × 100, EM: T (variable is an effect-modifier) and F (variable is not an effect-modifier). The true values of the coefficients are β V = (0.5, 0, 1, 0).

Table 2:

Simulation results (Data generating scenario 2).

Coef EM n = 1000 n = 10,000
β ̂ V %sel %Cov FCR β ̂ V %sel %Cov FCR
(1) Q & g model are correctly specified
V 1 T 0.47 99 96 5 0.49 100 95 5
V 2 F 0.00 20 0.00 13
V 3 T 0.98 100 95 1.00 100 95
V 4 F 0.00 23 0.00 12
(2) Q model is correctly specified
V 1 T 0.47 99 97 5 0.49 100 94 6
V 2 F 0.00 20 0.00 11
V 3 T 0.99 100 95 1.00 100 95
V 4 F 0.00 21. 0.00 11
(3) g model is correctly specified
V 1 T 0.32 55 99 2 0.47 99 99 2
V 2 F 0.01 19 0.00 14
V 3 T 0.85 94 98 0.99 100 99
V 4 F −0.01 24 0.00 21
(4) Q & g model are estimated using HAL
V 1 T 0.47 98 97 5 0.49 100 95 7
V 2 F 0.00 22 0.00 12
V 3 T 0.98 100 94 1.00 100 95
V 4 F 0.00 22 0.00 12
(6) Linear model correctly specified
V 1 T 0.50 89 96 5 0.50 100 95 5
V 2 F 0.00 6 94 0.00 6 94
V 3 T 1.00 100 94 1.00 100 95
V 4 F 0.00 4 97 0.00 4 96
  1. Estimates taken over 1000 generated datasets. β ̂ V : coefficients of the MSM, Cov: percent coverage of the selective confidence interval × 100, %sel: percent selection of variables × 100, FCR: False coverage rate × 100, EM: T (variable is an effect-modifier) and F (variable is not an effect-modifier). The true values of the coefficients are β V = (0.5, 0, 1, 0).

Table 3:

Simulation results (Data generating scenario 3).

Coef EM n = 1000 n = 10,000
β ̂ V %sel %Cov FCR β ̂ V %sel %Cov FCR
(1) Q & g model are correctly specified
V 1 T 0.44 94 97 5 0.49 100 96 5
V 2 F 0.00 23 0.00 16
V 3 T 0.97 100 95 1.00 100 97
V 4 F 0.00 23 0.00 17
(2) Q model is correctly specified
V 1 T 0.45 96 97 6 0.50 100 94 7
V 2 F 0.00 20 0.00 13
V 3 T 0.98 100 93 1.00 100 95
V 4 F 0.00 22 0.00 12
(3) g model is correctly specified
V 1 T 0.34 74 100 3 0.49 100 100 4
V 2 F 0.01 23 0.00 18
V 3 T 0.91 99 97 0.99 100 96
V 4 F 0.00 25 0.00 24
(4) Q & g model are estimated using HAL
V 1 T 0.45 95 95 6 0.49 100 95 5
V 2 F 0.00 24 0.00 16
V 3 T 0.98 100 94 1.00 100 96
V 4 F 0.00 23 0.00 16
(5) Naive linear model
V 1 T 0.60 89 93 10 0.59 100 63 43
V 2 F 0.10 76 92 0.10 35 65
V 3 T 1.21 100 81 1.21 100 58
V 4 F 0.01 38 96 −0.00 44 96
(6) Linear model correctly specified
V 1 T 0.50 98 96 5 0.50 100 95 5
V 2 F 0.00 4 96 0.00 5 95
V 3 T 1.00 100 95 1.00 100 95
V 4 F 0.00 5 95 0.00 5 95
  1. Estimates taken over 1000 generated datasets. β ̂ V : coefficients of the MSM, Cov: percent coverage of the selective confidence interval × 100, %sel: percent selection of variables × 100, FCR: False coverage rate × 100, EM: T (variable is an effect-modifier) and F (variable is not an effect-modifier). The true values of the coefficients are β V = (0.5, 0, 1, 0).

Table 4:

Simulation results for smaller sample size (n = 100).

Coef EM Scenario 1 Scenario 2 Scenario 3
β ̂ V %sel Cov FCR β ̂ V %sel Cov FCR β ̂ V %sel Cov FCR
(1) Q & g model are correctly specified
V 1 T 0.39 52 87 8 0.34 49 88 9 0.30 41 89 10
V 2 F −0.01 22 −0.01 25 0.02 24
V 3 T 0.85 86 94 0.78 80 96 0.78 71 93
V 4 F 0.01 28. 0.00 25 0.00 24
(2) Q model is correctly specified
V 1 T 0.38 53 91 7 0.36 50 88 8 0.29 41 89 10
V 2 F −0.03 27 0.00 21 0.01 20
V 3 T 0.83 85 98 0.79 8 97 0.76 72 93
V 4 F −0.02 25 0.00 27 0.00 21
(3) g model is correctly specified
V 1 T 0.24 20 97 9 0.24 25 98 6 0.26 25 91 9
V 2 F 0.04 16 0.04 1 0.04 26
V 3 T 0.51 29 90 0.59 45 95 0.68 47 88
V 4 F 0.01 21 0.02 23 0.00 25
(4) Q & g model are estimated using HAL
V 1 T 0.39 54 83 10 0.36 51 85 9 0.32 45 79 11
V 2 F 0.00 30 0.01 27 0.00 27
V 3 T 0.84 87 96 0.79 81 96 0.80 82 95
V 4 F 0.00 27 0.01 27 −0.02 24
  1. Estimates taken over 500 generated datasets. β ̂ V : coefficients of the MSM, Cov: percent coverage of the selective confidence interval × 100, %sel: percent selection of variables × 100, FCR: False coverage rate × 100, EM: T (variable is an effect-modifier) and F (variable is not an effect-modifier). The true values of the coefficients are β V = (0.5, 0, 1, 0).

Table 5:

Simulation results (Data generating scenario 1 with 50 noise covariates).

Coef EM n = 1000 n = 10,000
β ̂ V %sel %Cov FCR β ̂ V %sel %Cov FCR
(1) Estimates related to the potential EM that are not noise covariates
V 1 T 0.43 100 100 15 0.48 100 100 15
V 2 F 0.00 14 0.00 15
V 3 T 0.95 100 91 0.99 100 90
V 4 F 0.01 15 0.00 14
(2) Summary of the 50 potential EM that are noise covariates
Min −0.01 7.0 0.00 5
Q 1 0.00 12 0.00 11
Median 0.00 14 0.00 13
Q 3 0.00 16 0.00 15
Max 0.01 23 0.00 22
  1. Estimates taken over 100 generated datasets. β ̂ V : coefficients of the MSM, Cov: percent coverage of the selective confidence interval, %sel: percent selection of variables, FCR: False coverage rate, EM: T (variable is an effect-modifier) and F (variable is not an effect-modifier). The true values of the coefficients are β V = (0.5, 0, 1, 0, …, 0).

Table 6:

Baseline Characteristics of mothers in the cohort extraction (N = 4707).

Characteristics No ICS ICS
N (%) N (%)
Cohort size 2272 (100) 2435 (100)
Age
< 18 45 (1.9) 60 (2.4)
18–34 1958 (86.1) 2041 (83.8)
> 34 269 (11.8) 334(13.7)
Sex of the newborn 1149 (51.0) 1271 (52.0)
Welfare recipient 1126 (50.0) 1429 (59.0)
Urban residence 476 (18.0) 407 (20.0)
Hypertension 61 (3.0) 83 (3.0)
Diabetes 73 (3.0) 81 (3.0)
COPD 28 (1.0) 56 (2.0)
Cyanotic heart disease 7 (0.0) 8 (0.0)
Antiphospholipid syndrome 12 (1.0) 13 (1.0)
Uterine disorder 264 (12.0) 331 (14.0)
Epilepsy 18 (1.0) 23 (1.0)
Obesity 87 (4.0) 127 (5.0)
Lupus 1 (0.0) 2 (0.0)
Collagenous vascular disease 6 (0.0) 6 (0.0)
Cushing’s syndrome 4 (0.0) 4 (0.0)
Oral corticosteroids one year before pregnancy 234 (10.0) 281(12.0)
Oral SABA use one year before pregnancy 16 (1.0) 8 (0.0)
At least one dose of inhaled SABA taken per week 1523 (67.0) 1332 (55.0)
HIV 3 (0.0) 1 (0.0)
Cytomegalovirus infection 3 (0.0) 12 (0.0)
Leukotriene-receptor antagonists 33 (1.0) 30 (1.0)
Theophylline use one year before pregnancy 0 (0.0) 0 (0.0)
Intranasal corticosteroids 243 (11.0) 318 (13.0)
Folic acid one year before pregnancy 18 (1.0) 43 (2.0)
Teratogens taken one year before 0 (0.0) 0 (0.0)
Medication for epilepsy one year before pregnancy 29 (1.0) 48 (2.0)
Warfarin one year before pregnancy 7(0.0) 10 (0.0)
Use of beta-bloqueur one year before pregnancy 19 (1.0) 26 (1.0)
Asthma exacerbation one year before pregnancy 377 (17.0) 411 (17.0)
Hospitalization for asthma 1079 (47.0) 809 (33.0)
Chromosomal anomalies 6 (0.0) 4 (0.0)
Cumulative dose of ICS in days (mean (SD)) 51.6 (72.8) 54.0 (85.8)
One year cumulative dose of ICS before pregnancy (mean (SD)) 151 (32.0) 101.5 (126.3)
At least one emergency department visit for asthma 260 (7.0) 265 (19.0)
At least one hospitalization for asthma 5 (0.0) 8 (1.0)
Table 7:

Estimates of the coefficients associated with interaction terms using the naive linear model (n = 4707).

Variables Estimate ( β ̂ j ) STD p-Value
Intercept 3.153
CS:At least one dose of inhaled SABA taken per week −0.002 0.039 0.940
CS:Leukotriene-receptor antagonists −0.365 0.142 0.010*
CS:Intranasal corticosteroids 0.063 0.051 0.214
CS:Folic acid one year before pregnancy −0.129 0.159 0.415
CS:Medication for epilepsie −0.136 0.135 0.313
CS:Warfarin −0.386 0.277 0.164
CS:Beta-blockers −0.287 0.173 0.097
CS:Asthma exacerbation 0.062 0.069 0.368
CS:At least one hospitalization for asthma 0.017 0.036 0.624
CS:At least one emergency department visit for asthma 0.067 0.055 0.223
CS:COPD 0.141 0.130 0.280
CS:Cyanotic heart disease −0.345 0.292 0.237
CS:Oral corticosteroids one year before −0.081 0.081 0.319
CS:Obesity 0.053 0.080 0.508
CS:Uterine disorder −0.036 0.050 0.460
CS:Oral SABA use one year before −0.025 0.244 0.918
CS:Antiphospholipid syndrome 0.394 0.227 0.083
CS:Sex of new born −0.031 0.032 0.335
CS:Welfare recipient −0.043 0.033 0.187 1
CS:Rural/non-rural residence indicator 0.021 0.042 0.602
CS:Hypertension 0.028 0.098 0.774
CS:Diabetes −0.105 0.092 0.255
CS:Chromosomal anomalies −1.230 0.361 0.000 6*
CS:Cytomegalovirus infection 0.146 0.360 0.683
Table 8:

Estimates of the selected MSM coefficients using adaptive lasso (n = 4707) with 95% post selection interval for the selected variables.

Variables Estimate ( β ̂ j ) CI Low CI up
High adaptive LASSO for Q & g
Intercept 0.018
Leukotriene-receptor antagonists* −0.177 −0.502 −0.031
Warfarin −0.146 −0.745 0.311
Chromosomal anomalies* −0.777 −1.420 −0.285
  1. *: means interval excluded the null value.

Table 9:

Computation time in seconds for the simulation (run on a single dataset) and application.

Methods n = 1000 n = 4707 n = 10,000
Low-dimensional
Parametric regression for Q & g 0.16s 4.62s
Highly adaptive LASSO for Q & g 1.09s 9.06s
High-dimensional
Highly adaptive LASSO for Q & g 49.75s 600s
Data analysis
Highly adaptive LASSO for Q & g 115.2s

Proof of Lemma 1

Denote Q ̄ n (respectively g n ) an estimator of Q ̄ (respectively g). We have:

E P 0 ( D ( Q ̄ n , g n ) | V ) = E P 0 2 A 1 g n ( A | W ) ( Y Q ̄ n ( A , W ) ) + Q ̄ n ( 1 , W ) Q ̄ n ( 0 , W ) | V = E P 0 2 A 1 g n ( A | W ) Y Q ̄ n ( A , W ) | V + E P 0 { Q ̄ n ( 1 , W ) Q ̄ n ( 0 , W ) | V } + ψ 0 ( V ) ψ 0 ( V ) = ψ 0 ( V ) + E P 0 { Q ̄ n ( 1 , W ) Q ̄ n ( 0 , W ) } { Q ̄ 0 ( 1 , W ) Q ̄ 0 ( 0 , W ) } | V + E P 0 2 A 1 g n ( A | W ) ( Y Q ̄ n ( A , W ) ) | V = ψ 0 ( V ) + W { Q ̄ n ( 1 , W ) Q ̄ n ( 0 , W ) } { Q ̄ 0 ( 1 , W ) Q ̄ 0 ( 0 , W ) } + P 0 ( 1 | W ) g n ( 1 | W ) Q ̄ 0 ( 1 , W ) Q ̄ n ( 1 , W ) P 0 ( 0 | W ) g n ( 0 | W ) Q ̄ 0 ( 0 , W ) Q ̄ n ( 0 , W ) d P 0 ( W | V ) = ψ 0 ( V ) + W P 0 ( 1 | W ) g n ( 1 | W ) 1 Q ̄ 0 ( 1 , W ) Q ̄ n ( 1 , W ) + P 0 ( 0 | W ) g n ( 0 | W ) 1 Q ̄ 0 ( 0 , W ) Q ̄ n ( 0 , W ) d P 0 ( W | V )

Then E P 0 ( D ( Q ̄ n , g n ) | V ) ψ 0 ( V ) if g n (A| W ) or Q ̄ n ( A , W ) is consistently estimated.

Proof of Lemma 2

Let D n = D ( Q ̄ n , g n ) (respectively D 0 = D ( Q ̄ 0 , g 0 ) ) represent the estimated pseudo function (respectively the true pseudo-outcome). Our method minimizes the expected risk function below with respect to β:

D n j V ( j ) β j 2 + λ j = 1 p w ̂ j | β j |

where ω ̂ j = 1 / | β ̃ j | γ , j = 1, …, p, for some γ > 0.

Let ϵ n = D n − ∑ j V (j) β j be the residual of the penalized linear regression of D n on V . The proof follows essentially the one of Zou ([17]). We have to show that ϵ n T V / n follows a normal distribution with mean zero and a finite variance.

Indeed, one can write

ϵ n = ( D n D 0 ) + D 0 j V ( j ) β j .

ϵ n T V / n = n P n ( D n D 0 ) T V R 1 + n P n D 0 j V ( j ) β j T V R 2 .

with P n denotes the empirical measure. ϵ 0 = D 0 − ∑ j V (j) β j is the residual of the penalized linear regression of the oracle pseudo function D 0 on V . Therefore, if we assume 1 n V T V C with C a positive definite matrix, we have R 2 d N ( 0 , σ 2 C ) .

One can write

n P n ( D n D 0 ) T V n P n ( D n D 0 ) T V

Semenova and Chernozhukov ([20]) showed in Lemma A.3, given their Assumption 3.5, is that

n P n ( D n D 0 ) T V = o ( 1 )

Therefore, n P n ( D n D 0 ) T V = o ( 1 ) which yields the result.

References

1. Green, DP, Kern, HL. Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Publ Opin Q 2012;76:491–511. https://doi.org/10.1093/poq/nfs036.Search in Google Scholar

2. Chipman, HA, George, EI, McCulloch, RE. BART: Bayesian additive regression trees. Ann Appl Stat 2010;4:266–98. https://doi.org/10.1214/09-aoas285.Search in Google Scholar

3. Imai, K, Ratkovic, M. Estimating treatment effect heterogeneity in randomized program evaluation. Ann Appl Stat 2013;7:443–70. https://doi.org/10.1214/12-aoas593.Search in Google Scholar

4. Nie, X, Wager, S. Quasi-oracle estimation of heterogeneous treatment effects. 2017. arXiv:1712.04912.10.1093/biomet/asaa076Search in Google Scholar

5. Luo, W, Wu, W, Zhu, Y. Learning heterogeneity in causal inference using sufficient dimension reduction. J Causal Inference 2018;7:20180015. https://doi.org/10.1515/jci-2018-0015.Search in Google Scholar

6. Wager, S, Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. Ann Appl Stat 2018;112:1228–42. https://doi.org/10.1080/01621459.2017.1319839.Search in Google Scholar

7. Breiman, L. Random forests. Machine Learning, 2001;45:5–32.10.1023/A:1010933404324Search in Google Scholar

8. Powers, S, Qian, J, Jung, K, Schuler, A, Shah, N, Hastie, T, et al.. Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med 2018;2037:1767–87. https://doi.org/10.1002/sim.7623.Search in Google Scholar PubMed PubMed Central

9. Friedman, J. Multivariate adaptive regression splines. Ann Stat 1991;19:1–67. https://doi.org/10.1214/aos/1176347963.Search in Google Scholar

10. Zhao, Q, Small, DS, Ertefaie, A. Selective inference for effect modification via the lasso. 2018. arXiv:1705.08020.10.1111/rssb.12483Search in Google Scholar PubMed PubMed Central

11. Robinson, PM. Root-N-consistent semiparametric regression. Econometrica 1998;56:931–54. https://doi.org/10.2307/1912705.Search in Google Scholar

12. van der Laan, MJ, Rubin, D. Targeted maximum likelihood learning. Int J Biostat 2006;2. https://doi.org/10.2202/1557-4679.1043. 1016090934.Search in Google Scholar

13. van der Laan, MJ, Rose, S. Targeted learning: causal inference for observational and experimental data. In: Springer Series in Statistics. Springer, New York, NY; 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

14. Scharfstein, DO, Rotnitzky, A, Robins, JM. Adjusting for nonignorable dropout using semiparametric nonresponse models, (with discussion and rejoinder). J Am Stat Assoc 1999;94:1096–1120. https://doi.org/10.1080/01621459.1999.10473862.Search in Google Scholar

15. Bang, H, Robins, JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–72. https://doi.org/10.1111/j.1541-0420.2005.00377.x.Search in Google Scholar PubMed

16. Benkeser, D, Carone, M, van der Laan, MJ, Gilbert, P. Doubly robust nonparametric inference on the average treatment effect. Biometrika 2017;104:863–80. https://doi.org/10.1093/biomet/asx053.Search in Google Scholar PubMed PubMed Central

17. Lee, S, Okui, R, Whang, YJ. Doubly robust uniform confidence band for the conditional average treatment effect function. J Appl Econom 2017;32:1207–25. https://doi.org/10.1002/jae.2574.Search in Google Scholar

18. Zheng, W, Luo, Z, van der Laan, MJ. Marginal structural models with counterfactual effect modifiers. Int J Biostat 2018;14:20180039. https://doi.org/10.1515/ijb-2018-0039.Search in Google Scholar PubMed PubMed Central

19. Kennedy, EH. Optimal doubly robust estimation of heterogeneous causal effects. 2020. arXiv:2004.14497v1.Search in Google Scholar

20. Semenova, V, Chernozhukov, V. Debiased machine learning of conditional average treatment effects and other causal functions. Econom J 2020;24:1–49. https://doi.org/10.1093/ectj/utaa027.Search in Google Scholar

21. van der Laan, MJ. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. In: U.C. Berkeley Division of Biostatistics Working Paper Series; 2013.Search in Google Scholar

22. Zhao, Y, Laber, EB, Ning, Y, Saha, S, Sands, B. Efficient augmentation and relaxation learning for individualized treatment rules using observational data. 2019. arXiv:1901.00663.Search in Google Scholar

23. Kennedy, EH, McHugh, MD, Small, DS. Non-parametric methods for doubly robust estimation of continuous treatment effects. J Roy Stat Soc B 2017;79:1229–45. https://doi.org/10.1111/rssb.12212.Search in Google Scholar PubMed PubMed Central

24. Zou, H. The adaptive LASSO and its oracle properties. J Am Stat Assoc 2006;101:1418–29. https://doi.org/10.1198/016214506000000735.Search in Google Scholar

25. Rubin, D. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974;66:688–701. https://doi.org/10.1037/h0037350.Search in Google Scholar

26. Cole, SR, Frangakis, CE. The consistency statement in causal inference: a definition or an assumption? Epidemiology 2009;20:3–5. https://doi.org/10.1097/ede.0b013e31818ef366.Search in Google Scholar

27. Hernan, MA, Robins, JM. Causal inference: what if. FL: Chapman and Hall-CRC; 2019.Search in Google Scholar

28. Zhao, Q, Hastie, T. Causal interpretations of black-box models. J Bus Econ Stat 2019;39:272–81. https://doi.org/10.1080/07350015.2019.1624293.Search in Google Scholar PubMed PubMed Central

29. Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar

30. Benkeser, D, van der Laan, MJ. The highly adaptive LASSO estimator. In: 2016 IEEE international conference on data science and advanced analytics. IEEE; 2016:689–96 pp.10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

31. Lee, JD, Sun, DL, Sun, Y, Taylor, JE. Exact post-selection inference, with application to the LASSO. Ann Stat 2016;44:907–27. https://doi.org/10.1214/15-aos1371.Search in Google Scholar

32. Rubin, D, van der Laan, MJ. A doubly robust censoring unbiased transformation. Int J Biostat 2007;3. https://doi.org/10.2202/1557-4679.1052. 22550646.Search in Google Scholar PubMed

33. Rubin, D, van der Laan, MJ. Extending marginal structural models through local, penalized, and additive learning. In: U.C. Berkeley Division of Biostatistics Working Paper Series; 2006.Search in Google Scholar

34. Yuan, M, Lin, Y. On the non-negative garrotte estimator. J Roy Stat Soc B 2007;69:143–61. https://doi.org/10.1111/j.1467-9868.2007.00581.x.Search in Google Scholar

35. Chernozhukov, V, Chetverikov, D, Demirer, M, Duflo, E, Hansen, C. Newey, W, et al.. Double/debiased machine learning for treatment and structural parameters. Econom J 2018;21:C1–C68. https://doi.org/10.1111/ectj.12097.Search in Google Scholar

36. Tibshirani, R, Taylor, J, Loftus, J, Reid, S. Selective inference: tools for post-selection inference. 2019. Available from: https://CRAN.R-project.org/package=selectiveInference. 2017b.Search in Google Scholar

37. Hejazi, NS, Coyle, JR, van der Laan, MJ. hal9001: the scalable highly adaptive lasso. 2020. Available from: https://github.com/tlverse/hal9001.10.21105/joss.02526Search in Google Scholar

38. Firoozi, F, Lemire, C, Beauchesne, MF, Forget, A, Blais, L. Development and validation of database indexes of asthma severity and control. Thorax 2007;62:581–7. https://doi.org/10.1136/thx.2006.061572.Search in Google Scholar PubMed PubMed Central

39. Cossette, B, Forget, A, Beauchesne, MF, Rey, E, Larivée, P, Battista, MC, et al.. Impact of maternal use of asthma-controller therapy on perinatal outcomes. Thorax 2013;68:724–30. https://doi.org/10.1136/thoraxjnl-2012-203122.Search in Google Scholar PubMed

40. Bahamyirou, A, Blais, L, Forget, A, Schnitzer, ME. Understanding and diagnosing the potential for bias when using machine learning methods with doubly robust causal estimators. Stat Methods Med Res 2018;28:1637–50. https://doi.org/10.1177/0962280218772065.Search in Google Scholar PubMed

41. Javanmard, A, Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 2014;15:2869–909.Search in Google Scholar

42. Ju, C, Benkeser, D, van der Laan, MJ. Robust inference on the average treatment effect using the outcome highly adaptive lasso. Biometrics 2020; 76:109–18. https://doi.org/10.1111/biom.13121.Search in Google Scholar PubMed

43. VanderWeele, TJ, Knol, MJ. A tutorial on interaction. Epidemiology 2014;173:731–8. https://doi.org/10.1515/em-2013-0005.Search in Google Scholar

Received: 2020-05-22
Revised: 2021-10-08
Accepted: 2021-12-09
Published Online: 2022-01-04

© 2021 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 5.6.2023 from https://www.degruyter.com/document/doi/10.1515/ijb-2020-0073/html
Scroll to top button