Abstract
Often the research interest in causal inference is on the regression causal effect, which is the mean difference in the potential outcomes conditional on the covariates. In this paper, we use sufficient dimension reduction to estimate a lower dimensional linear combination of the covariates that is sufficient to model the regression causal effect. Compared with the existing applications of sufficient dimension reduction in causal inference, our approaches are more efficient in reducing the dimensionality of covariates, and avoid estimating the individual outcome regressions. The proposed approaches can be used in three ways to assist modeling the regression causal effect: to conduct variable selection, to improve the estimation accuracy, and to detect the heterogeneity. Their usefulness are illustrated by both simulation studies and a real data example.
1 Introduction
Causal inference has been widely applied for decades to draw cause-and-effect conclusions based on observational studies, in which treatments are assigned to observations in a non-random fashion. In many cases, theories in causal inference are developed under the potential outcome framework [1]. That is, when the treatment assignment is binary, i. e., either treated or untreated, the outcome variable in the hypothetical complete data set has two components
Let T be the treatment assignment with support {0,1}. Since for each subject, only
under which the distribution of
Because the regression causal effect is only a part of the two-dimensional function
Then η must be estimated in the first step. However, this estimation would not be needed had we known that the regression causal effect is a constant, in which case we could estimate
In addition to fitting the regression function, another common interest about the regression causal effect is to find a pre-assumed low-dimensional transformation of the covariates, denoted by
where f is an free and unknown function and the error term ϵ satisfies
In the literature, estimation of
When
In this paper, we use sufficient dimension reduction to propose a new and model-free estimator of
2 Sufficient dimension reduction
Sufficient dimension reduction is a family of methods that aims to reduce the dimension of covariates prior to subsequent modeling. When the regression of a response variable W on the p-dimensional covariates X is of interest, it assumes the existence of
where f is a free and unknown function and ϵ satisfies
Existing methods for estimating the central mean subspace include ordinary least square [15], principal Hessian directions [16], iterative Hessian transformations [14], minimal average variance estimation [17], and other semiparametric methods [18], [19], etc. A method is called Fisher-consistent if it recovers a subspace of
where β spans
where β spans
Because ordinary least square is built upon
A separate issue in sufficient dimension reduction is to estimate the dimension d of
3 The central causal effect subspace
Before digging into more details, we introduce the notations and some regularity conditions. Let
For any real vector v, denote its Euclidean norm by
When
In the literature, multiple papers have studied the application of sufficient dimension reduction in causal inference. Ghosh [24] applied it to the propensity score and introduced
Because it is the regression causal effect, rather than the outcome regressions or the propensity score, that serves as the primary interest in causal inference, it must be
Because
which is observable from the data and can substitute
This observation is the key to our theoretical development. The proof of (9) is straightforward and is omitted. Its weaker version
Theorem 1.
For
The proof of this theorem can be found in Appendix A.1. In practice, the propensity score is unknown and needs to be estimated. Throughout the article, we assume that a consistent estimator
For ease of asymptotic study, we additionally assume
(C1) Assume
This asymptotic linearity assumption holds for the logistic regression, the covariate balancing propensity score, and the super learner, etc., so it is fairly general. The case where (C1) is violated will be discussed in § 8.
Using
Step 0. Let
Step 1. Estimate
Step 2. Estimate
Step 3. Let
In Step 3, one needs to determine the rank of
We next develop the asymptotic normality of the matrix estimator
Theorem 2.
Suppose the unconfoundedness assumption (1), the common support condition (7), and (C1) hold, and the sample observations are independent. As
The proof of this theorem can be found in Appendix A.2. As mentioned in § 1, an estimator of the central causal effect subspace can be used in three ways: to perform variable selection for the regression causal effect, to improve the estimation accuracy of the regression causal effect, and, to detect the heterogeneity of the regression causal effect. We next discuss these in details.
4 A sparse estimation
To enhance the interpretability of the central causal effect subspace and its estimator, we now conduct sparse sufficient dimension reduction. That is, in addition to (4) with
This assumption is commonly adopted in variable selection. It means that all the components of
To incorporate the sparsity structure into the ensemble moment-based estimator, we follow Chen, Zou, and Cook [29] to convert the eigen-decomposition of
Step 4. Let
subject to
In (13), θ is the tuning parameter that determines how sparse the resulting estimate is. Chen et al. [29] suggested selecting θ by minimizing a Bayesian information criterion, which we slightly modify to be
where
To minimize (13), Chen et al. [29] suggested an algorithm that alternates between the local quadratic approximation and spectral decomposition until convergence. Chen et al. [29] also showed the selection consistency and the oracle property of the resulting sparse estimator. These properties can be directly parallelized for our case, proof omitted here.
Theorem 3.
Let
By Theorem 3, the sparse ensemble moment-based estimator consistently selects the active set for the regression causal effect, and is asymptotically equally accurate as the oracle estimator. A simulation study (see simulation study 1 of Supplementary Material) showed that in the finite-sample level, it consistently outperforms the ordinary ensemble moment-based estimator when the sparsity assumption (12) holds. This differs from the commonly observed phenomenon in variable selection, where sparse estimators are always suboptimal to their ordinary counterparts in terms of larger finite-sample bias. However, it is reasonable in the sufficient dimension reduction scenario, as the estimation accuracy is not measured for the coefficients of individual covariates in the active set, but rather for the entire central mean subspace, which would be improved if the estimation error associated with the irrelative covariates is wiped out.
The sparse ensemble moment-based estimator inherits the advantage that it avoids fitting the individual outcome regressions. This is shared by the variable selection procedure in Tian et al. [6], not by the others mentioned in § 1.
5 Estimation of the regression causal effect
Equation (9) suggests estimating the regression causal effect by equivalently estimating
As an illustration, we next use local linear regression to estimate the regression causal effect. Based on the scatter plot of the reduced covariates and this estimate, one may also adopt appropriate parametric models to further improve the estimation. Following Step 4 in § 4, we have
Step 5. For any
Here,
In the literature, a common strategy to estimate the regression causal effect is to treat it as the difference between the individual outcome regression functions, and estimate the latter within each treatment group. Unfortunately, this strategy can not be parallelized if we use the reduced covariates from the central causal effect subspace in place of X. The reason is that these reduced covariates may not be sufficient to predict the individual outcomes, so the un-confoundedness assumption (1) would fail and the individual outcome regression would not be estimable by the observed data. For example, in Model (2) where
6 Detecting heterogeneous causal effect
Based on the asymptotic normality result in Theorem 2, all the aforementioned order-determination methods in § 2 can be applied to detect whether the central casual effect subspace is zero dimensional, which corresponds to a homogeneous regression causal effect. As an example, we employ the hypothesis testing procedure proposed in Bura and Yang [21].
The test is based on the observation that when
in distribution, where
To estimate the
Given the estimates
Compared with the parametric and nonparametric tests in Crump et al. [11], our test enjoys the advantages of both: it can detect a
7 Simulation studies
We use the simulated models to illustrate the effectiveness of the sparse ensemble moment-based estimator in estimating the regression causal effect and in variable selection, and that of the proposed
7.1 Estimating the regression causal effect
We consider the following four models. In each model, the treatment assignment T is generated independently of the outcomes conditional on the propensity score, so the un-confoundedness assumption (1) holds.
Model 1.
Model 2.
Model 3.
Model 4.
For
In Models 1 and 2, the regression causal effect is linear, although the individual outcome regression functions are more complex in Model 2 with non-monotone structure. Theoretically, the proposed methods will perform consistently in both models, while all the existing methods that rely on individual outcome regressions are expected to be competent for Model 1. In Models 3 and 4, both the regression causal effect and the outcome regression functions are nonlinear. In conjunction with the various propensity scores, these models represent a variety of cases in practice.
From the dimension reduction point of view, the ensemble space
As mentioned in § 4, we use the sparse ensemble moment-based estimator in estimating
To measure the overall deviation of an estimator
where
Accuracy of regression causal effect estimation. The number in the top (bottom) of each cell is the sample average (standard deviation) of the deviation between the regression causal effect and its estimate over 200 replicates, multiplied by 100. “Oracle” stands for the oracle estimator, “S-ENS” for the proposed method based on the sparse ensemble moment-based estimator, “LZG”, “GCL”, “GCQ” and “RF” for the semiparametric, the linear, the quadratic, and the random forest G-computations, “GEL”,“GEQ”, “WML”, and “WMQ” for G-estimation and Wallace and Moodie’s approach with the regression causal effect fitted by linear model and quadratic model, respectively.
Model | Oracle | S-ENS | LZG | GCL | GCQ | RF | GEL | GEQ | WML | WMQ |
1 | 6.6 | 6.6 | 8.9 | 8.8 | 11.4 | 29.0 | 9.9 | 12.4 | 9.9 | 12.4 |
(4.4) | (4.4) | (3.8) | (3.8) | (3.1) | (3.3) | (2.3) | (2.2) | (2.3) | (2.3) | |
2 | 8.9 | 16.8 | 15.5 | 52.6 | 14.1 | 55.7 | 16.6 | 15.9 | 16.7 | 15.9 |
(2.7) | (4.1) | (2.3) | (5.8) | (0.8) | (2.5) | (3.7) | (2.8) | (3.6) | (2.8) | |
3 | 18.7 | 20.9 | 21.2 | 57.3 | 19.5 | 28.2 | 71.6 | 19.6 | 71.5 | 19.6 |
(4.9) | (6.9) | (5.3) | (6.5) | (3.4) | (2.2) | (6.8) | (3.1) | (6.8) | (3.2) | |
4 | 19.5 | 19.9 | 53.8 | 125.6 | 87.7 | 96.4 | 125 | 87.6 | 125 | 87.6 |
(3.2) | (2.9) | (3.5) | (5) | (6.6) | (12.7) | (5.2) | (6.4) | (5.2) | (6.3) |
From Table 1, the linear G-computation is consistent only in Model 1. It is slightly improved by the linear G-estimation, as the latter truly specifies both the regression causal effect and the propensity score in Models 1 and 2. The quadratic G-computation is consistent in Models 1–3, as well as the quadratic G-estimation. The dynamic weighted least square approaches perform almost identically to the G-estimations, which conforms to the results in Wallace and Moodie [9]. All these methods fail in Model 4, where they mis-specify parametric models on the regression causal effect. On the other hand, the random forest estimator, which is non-parametric, is lack of effectiveness in most models due to the limited sample size.
By contrast, both the proposed estimator and the semiparametric G-computation are consistent in all the models. In particular, the proposed estimator outperforms the linear G-computation in Model 1, for which the latter adopts parsimonious and appropriate parametric models. This is not surprising, as the proposed estimator uses additional sparsity structure in the model. Compared with the semiparametric G-computation, the proposed estimator is substantially superior in Model 4, where the outcome regression functions are complex. Referring to the discussion in § 5, this conforms to our theoretical expectation. Compared with the oracle estimator, the proposed estimator is less effective in Model 2, indicating that the cost of estimating the central causal effect subspace can be non-negligible.
7.2 Variable selection
We now examine the variable selection consistency of the sparse ensemble moment-based estimator, by evaluating its true positive rate and false positive rate of selecting the active set of covariates, when applied to Models 1–4.
Persson et al. [13] proposed a variable selection procedure that estimates two active sets, each for an individual outcome regression. Naturally, the union of the two sets can be treated as an estimate of the active set for the regression causal effect, although their estimate can contain redundant covariates, for example, in Models 3 and 4. The difference Lasso approach [4] and the virtual twins method [10] first impute the missing outcomes using nonparametric techniques and then use the imputed
Accuracy of variable selection methods. Each cell in the column TPR (FPR) is the average true (false) positive rate of the variable selection method, over 200 simulation samples. “S-ENS” stands for the sparse moment-based estimator, “VT” for the virtual twins method, “dLasso” for the difference lasso approach, and “PHWD” for the method based on Persson et al. [13].
Model | S-ENS | VT | dLasso | PHWD | ||||
TPR | FPR | TPR | FPR | TPR | FPR | TPR | FPR | |
1 | 1.000 | 0.000 | 1.000 | 0.007 | 1.000 | 0.388 | 1.000 | 0.433 |
2 | 1.000 | 0.008 | 1.000 | 0.000 | 1.000 | 0.485 | 1.000 | 0.434 |
3 | 1.000 | 0.099 | 1.000 | 0.185 | 0.530 | 0.290 | 1.000 | 0.443 |
4 | 1.000 | 0.000 | 1.000 | 0.000 | 0.702 | 0.306 | 1.000 | 0.361 |
Due to the use of the Lasso method, the difference Lasso approach favors a linear regression causal effect. This is supported by the results in Table 2, which show that the method is incompetent in Models 3 and 4. The approach based on Persson et al. [13] has a desired sensitivity in all the models, but with a worrisome specificity by its nature. By contrast, both the virtual twins method and the sparse ensemble moment-based estimator constantly select the exact active set in all the models, with the former slightly outperformed by the latter in Model 3.
7.3 Testing the heterogeneity
We now evaluate the proposed
Model 5.
Model 6.
Model 7.
The distribution of
We perform the proposed
As mentioned in § 1, Crump et al. [11] proposed both a normal test and a
Percentage of correct decision made by each test. “SDR
1 | 3 | 4 | 5 | 6 | 7 | |
SDR | 93.3 | 93.1 | 99.9 | 98.7 | 96.7 | 99.7 |
r-Normal | 100 | 58.2 | 100 | 89.1 | 23.9 | 8.8 |
100 | 53.8 | 100 | 92.5 | 27.2 | 10.0 | |
Normal | 100 | 79.4 | 100 | 91.3 | 4.7 | 4.0 |
100 | 76.5 | 100 | 94.5 | 5.2 | 5.4 |
From Table 3, the proposed
To give a closer look at the performance of the tests, in Figure 1, we draw the box-plot of p-values for each model and each test, except for the normal tests as they perform similarly to Crump et al.’s
![Figure 1 Boxplot of p-values among 1000 replications. The left, middle, and right boxes represent p-values of the robust and the ordinary χ2{\chi ^{2}} tests in Crump et al. [11], and the proposed χ2{\chi ^{2}} test, respectively.](/document/doi/10.1515/jci-2018-0015/asset/graphic/j_jci-2018-0015_fig_001.jpg)
Boxplot of p-values among 1000 replications. The left, middle, and right boxes represent p-values of the robust and the ordinary
7.4 Data analysis
We analyzed the data from the health evaluation and linkage to primary care study, publicly available with the approval of the Institutional Review Board of Boston University Medical Center and the Department of Health and Human Services. The data set contains 453 patients recruited from a detoxification unit, who possibly spent at least one night on the street or shelter within six months before entering the study, in which case the patient is marked as homeless. Our interest is to estimate the causal effect of the homeless experience on patients’ physical health condition, measured when entering the study and by the SF-36 physical component score, with higher scores indicating better functioning.
To make the un-confoundedness assumption plausible, we included all the covariates collected in the data who do not have many missing values, some of which were transformed to favor the linearity condition (5) and the constant variance condition (6). They are: age at baseline, a scale indicating depressive symptoms, the square root of the number of friends, the square of a total score of inventory of drug use consequences, the square root of a sex risk score, gender, the average and the maximum number of drinks consumed per day in the past month, and the number of times hospitalized for medical problems. All the nine covariates were standardized to have zero mean and unit variance.
By applying the proposed test, we detected that the regression causal effect is heterogeneous with p-value 0.04. The sequential tests [21] further suggested that
Thus, gender is the dominating factor for the causal effect of homeless experience, and the number of friends and the sex risk are also affective. Cross-validation [17], [5] showed that both
To illustrate the sufficiency and effectiveness of the univariate reduced covariate, we generated a pseudo

Homeless data with reduced covariates. The left (right) panel is the scatter plot of the imputed
8 Discussion
When a set of appropriate parametric models is not available for the propensity score
In these cases, it is easily seen that the ensemble moment-based estimator is still consistent but its asymptotically normality (10) fails, for which the inferential results must be adjusted. For the variable selection consistency, the order of the tuning parameter θ in Theorem 3 needs to be adjusted according to the convergence order of
As mentioned earlier, an alternative that avoids estimating the propensity score is to estimate the regression causal effect as the difference between the outcome regression functions, the latter being estimated semiparametrically [5]. Following the literature, one may think of constructing a doubly-robust estimator that combines the two approaches. However, in contrast to the case where the parameter of interest is the average causal effect and both the propensity score and the outcome regression functions are estimated parametrically, such an estimator will always inherit the drawback of outcome regression-based estimator, i. e. the estimation of the nuisance functional parameters mentioned in § 1 and the redundant directions in
Funding source: Social Sciences and Humanities Research Council of Canada
Award Identifier / Grant number: 430-2016-00163
Funding source: Natural Sciences and Engineering Research Council of Canada
Award Identifier / Grant number: RGPIN-2017-04064
Funding statement: Dr. Zhu’s research was supported by Award Number 430-2016-00163 from the Social Sciences and Humanities Research Council and by Grant Number RGPIN-2017-04064 from the Natural Sciences and Engineering Research Council of Canada.
Appendix A
A.1 Proof of Theorem 1
A.2 Proof of Theorem 2
Proof.
We first show the asymptotic normality of
which we denote by
Because
and that by the central limit theorem,
References
1. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.10.1037/h0037350Search in Google Scholar
2. Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–512.10.1016/0270-0255(86)90088-6Search in Google Scholar
3. Snowden JM, Rose S, Mortimer KM. Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. Am J Epidemiol. 2011;173:731–8.10.1093/aje/kwq472Search in Google Scholar PubMed PubMed Central
4. Ghosh D, Zhu Y, Coffman DL. Penalized regression procedures for variable selection in the potential outcomes framework. Stat Med. 2015;34:1645–58.10.1002/sim.6433Search in Google Scholar PubMed PubMed Central
5. Luo W, Zhu Y, Ghosh D. On estimating regression-based causal effects using sufficient dimension reduction. Biometrika. 2017;104:51–65.10.1093/biomet/asw068Search in Google Scholar
6. Tian L, Alizadeh AA, Gentles AJ, Tibshirani R. A simple method for estimating interactions between a treatment and a large number of covariates. J Am Stat Assoc. 2014;109:1517–32.10.1080/01621459.2014.951443Search in Google Scholar PubMed PubMed Central
7. Abrevaya J, Hsu Y-C, Lieli RP. Estimating conditional average treatment effects. J Bus Econ Stat. 2015;33:485–505.10.1080/07350015.2014.975555Search in Google Scholar
8. Robins JM. Optimal structural nested models for optimal sequential decisions. In: Proceedings of the second seattle Symposium in Biostatistics. Springer; 2004. p. 189–326.10.1007/978-1-4419-9076-1_11Search in Google Scholar
9. Wallace MP, Moodie EE. Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics. 2015;71:636–44.10.1111/biom.12306Search in Google Scholar PubMed
10. Foster JC, Taylor JMG, Ruberg SJ. Subgroup identification from randomized clinical trial data. Stat Med. 2011;30:2867–80.10.1002/sim.4322Search in Google Scholar PubMed PubMed Central
11. Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Nonparametric tests for treatment effect heterogeneity. Rev Econ Stat. 2008;90:389–405.10.3386/t0324Search in Google Scholar
12. Imai K, Ratkovic M. Estimating treatment effect heterogeneity in randomized program evaluation. Ann Appl Stat. 2013;7:443–70.10.1214/12-AOAS593Search in Google Scholar
13. Persson E, Häggström J, Waernbaum I, de Luna X. Data-driven algorithms for dimension reduction in causal inference. Comput Stat Data Anal. 2017;105:280–92.10.1016/j.csda.2016.08.012Search in Google Scholar
14. Cook RD, Li B. Dimension reduction for conditional mean in regression. Ann Stat. 2002;30:455–74.10.1214/aos/1021379861Search in Google Scholar
15. Li K-C, Duan N. Regression analysis under link violation. The Annals of Statistics. 1989. 1009–1052.10.1214/aos/1176347254Search in Google Scholar
16. Li K-C. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. J Am Stat Assoc. 1992;87:1025–39.10.1080/01621459.1992.10476258Search in Google Scholar
17. Xia Y, Tong H, Li WK, Zhu L-X. An adaptive estimation of dimension reduction space. J R Stat Soc, Ser B, Stat Methodol. 2002;64:363–410.10.1142/9789812836281_0023Search in Google Scholar
18. Luo W, Li B, Yin X. On efficient dimension reduction with respect to a statistical functional of interest. Ann Stat. 2014;42:382–412.10.1214/13-AOS1195Search in Google Scholar
19. Ma Y, Zhu L. On estimation efficiency of the central mean subspace. J R Stat Soc, Ser B, Stat Methodol. 2014;76:885–901.10.1111/rssb.12044Search in Google Scholar
20. Hall P, Li K-C. On almost linearity of low dimensional projections from high dimensional data. Ann Stat. 1993;47:867–89.10.1214/aos/1176349155Search in Google Scholar
21. Bura E, Yang J. Dimension estimation in sufficient dimension reduction: a unifying approach. J Multivar Anal. 2011;102:130–42.10.1016/j.jmva.2010.08.007Search in Google Scholar
22. Zhu L, Miao B, Peng H. On sliced inverse regression with high-dimensional covariates. J Am Stat Assoc. 2006;101:630–42.10.1198/016214505000001285Search in Google Scholar
23. Luo W, Li B. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika. 2016;103:875–87.10.1093/biomet/asw051Search in Google Scholar
24. Ghosh D. Propensity score modelling in observational studies using dimension reduction methods. Stat Probab Lett. 2011;81:813–20.10.1016/j.spl.2011.03.002Search in Google Scholar PubMed PubMed Central
25. Hu Z, Follmann DA, Wang N. Estimation of mean response via the effective balancing score. Biometrika. 2014;101:613–24.10.1093/biomet/asu022Search in Google Scholar PubMed PubMed Central
26. Huang M-Y, Chan KCG. Joint sufficient dimension reduction and estimation of conditional and average treatment effects. Biometrika. 2017;104:583–96.10.1093/biomet/asx028Search in Google Scholar PubMed PubMed Central
27. Imai K, Ratkovic M. Covariate balancing propensity score. J R Stat Soc B. 2014;76:243–63.10.1111/rssb.12027Search in Google Scholar
28. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:1–21.10.2202/1544-6115.1309Search in Google Scholar PubMed
29. Chen X, Zou C, Cook R. Coordinate-independent sparse suffcient dimension reduction and variable selection. Ann Stat. 2010;6:3696–723.Search in Google Scholar
Supplemental Material
The online version of this article offers supplementary material (https://doi.org/10.1515/jci-2018-0015).
© 2019 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.