1 Introduction
A common goal of epidemiologic research is to study the association between a certain exposure A and a certain outcome Y. Typically, it is desirable to control for covariates L in the analysis. This, for instance, is the case when the covariates are potential confounders for the exposure-outcome association, or when the covariates are potential mediators and the aim is to study the direct exposure effect. A common tool for covariate control is the restricted mean model
The model in eq. [1] can be decomposed into two parts:
To protect the main model against bias bue to misspecification of the outcome nuisance model, doubly robust (DR) estimators have been proposed (e.g. Robins et al., 1992; Robins and Rotnitzky, 2001; Bang and Robins, 2005; Tchetgen Tchetgen et al., 2010). These estimators combine an outcome nuisance model with an exposure nuisance model, and are unbiased for the parameters in the main model if either nuisance model is correctly specified, not necessarily both. Thus, DR estimators give the researcher two chances instead of one to make valid inference on the parameters of main interest.
Despite their obvious appeal, DR estimators are not used on a regular basis in applied epidemiologic research. One reason for this could be the lack of up-to-date software. To remedy this deficiency we have implemented an
Three broad classes of estimation methods are implemented by the
In this paper we describe the
2 O-estimation
2.1 Theory
We assume that data consist of iid observations of
2.2 Example 1
To demonstrate how the
Suppose that we wish to use these data study if there is a direct effect of sex on wages, not mediated through education level. Such a direct effect is clearly of substantive interest, since it would be an indication of sex discrimination. To eliminate the mediated effect we wish to control for education level. However, to avoid bias we must then additionally control for covariates that can be confounders for the mediator and the outcome (see Valeri and VanderWeele, 2013 and references therein). It is obvious that age can be such a mediator-outcome confounder, since age is likely to be associated with both education level and wages. Native language may also be a mediator-outcome confounder, by being associated with education level and wages through ethnicity and socio-economic status. We thus use a model for the mean of
> library(car) > fit <- drgee(oformula=wages~education + age + language, + exposure="sex", estimation.method="o", data=SLID)
By setting the
> summary(fit)
which gives us the output
Call: drgee(exposure = "sex", oformula = wages ~ education + age + language, data = SLID, estimation.method = "o") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: identity Estimate Std. Error z value Pr(>|z|) sexMale 3.4554 0.2091 16.53 <2e-16 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
Only the result for the main parameters are shown in the output. Since
2.3 Example 2
The linear and interaction-free model in Example 1 is simple, but may not be entirely realistic. Wage distributions are often right skewed, and therefore a linear model may not fit data very well. Furthermore, inference for direct effects may be misleading if significant exposure-mediator interactions are omitted from the model (see Valeri and VanderWeele, 2013 and references therein).
Therefore, suppose that we want to use a log-linear model instead, including an interaction between sex and education. We then replace the main model [7] with
> fit <- drgee(oformula=wages~education+age+language, + exposure="sex", iaformula=~education, olink="log", + estimation.method="o", data=SLID)
The
Call: drgee(exposure = "sex", oformula = wages ~ education + age + language, iaformula = ~education, olink = "log", data = SLID, estimation.method = "o") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: log Estimate Std. Error z value Pr(>|z|) sexMale 0.581088 0.062848 9.246 < 2e-16 *** sexMale:education –0.026175 0.004511 –5.802 6.54e-09 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
We observe that both the main effect of sex and the sex-education interaction are highly significant.
3 E-estimation
In E-estimation we use the main model [5], as in O-estimation. However, in contrast to O-estimation, E-estimation leaves the covariate part
3.1 Theory: E-estimation when g is the identity or log link function
When g is the identity or log link function, we use a model for the mean of the exposure conditional on the covariates:
3.2 Example 3
Continuing Example 2 (Section 2.3), suppose that we want to combine the main model [9] with the exposure nuisance model
> fit <- drgee(outcome="wages", + eformula=sex~education + age + language, + iaformula=~education, olink="log", elink="logit", + estimation.method="e", data=SLID)
By setting the
Call: drgee(outcome = "wages", eformula = sex ~ education + age + language, iaformula = ~education, olink = "log", elink = "logit", data = SLID, estimation.method = "e") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome link function: log Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 0.370139 0.064773 5.714 1.1e-08 *** sexMale:education –0.010613 0.004738 –2.240 0.0251 * --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
We observe that the obtained E-estimates are quite different from the O-estimates (see Section 2.3). This indicates that at least one of the nuisance models [10] and [13] is misspecified.
3.3 Theory: E-estimation when g is the logit link function
When g is the logit link function and main model [5] holds,
3.4 Example 4
To perform E-estimation with logit outcome link function, we need both outcome and exposure to be binary. Suppose that we recode
> SLID$highWage<-ifelse(SLID$wages< = 14,0,1)
We can then use the logistic main model
> fit <- drgee(outcome = "highWage", eformula = sex~education + age + language, + iaformula = ~education, olink = "logit", elink = "logit", + estimation.method = "e", data = SLID) > summary(fit) Call: drgee(outcome = "highWage", eformula = sex ~ education + age + language, iaformula = ~education, olink = "logit", elink = "logit", data = SLID, estimation.method = "e") Outcome: highWage Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: highWage ~ sexMale + sexMale:education Outcome link function: logit Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 1.93520 0.30980 6.247 4.2e-10 *** sexMale:education –0.06361 0.02287 –2.782 0.00541 ** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
Again, both main effect and interaction are highly significant.
4 DR-estimation
In DR-estimation, we combine a main model with an outcome nuisance model and an exposure nuisance model to construct an estimator
4.1 Theory: DR-estimation when g is the identity or log link function
Let
4.2 Example 5
Continuing Example 2 (Section 2.3) and Example 3 (Section 3.2), suppose that we want to combine the main model [9] with the outcome nuisance model [10] and the exposure nuisance model [13]. To use DR-estimation based on these models we type:
> fit <- drgee(oformula=wages~education+age+language, + eformula=sex~education+age+language, + iaformula=~education, olink="log", elink="logit", + estimation.method="dr", data=SLID)
By setting the
> summary(fit) Call: drgee(oformula = wages ~ education + age + language, eformula = sex ~ education + age + language, iaformula = ~education, olink = "log", elink = "logit", data = SLID, estimation.method = "dr") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: log Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 0.57752 0.06333 9.119 < 2e-16 *** sexMale:education –0.02591 0.00455 –5.696 1.23e-08 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
We observe that the DR estimates are very similar to the O-estimates in Example 2 (Section 2.3), but quite different from the E-estimates in Example 3 (Section 3.2). This indicates that the exposure nuisance model may not be well specified. However, it does not prove that the outcome nuisance model is correct; we demonstrate in Section 6.1 that all methods may agree well even though both nuisance models are misspecified.
4.3 Theory: DR-estimation when g is the logit link function
When g is the logit link function, DR-estimation can be performed by the
4.4 Example 6
Continuing Example 4 (Section 3.4), suppose that we want to combine main model [15], the exposure nuisance model [16] and the logistic outcome nuisance model
> fit <- drgee(oformula=highWage~education+age+language, + eformula=sex~education+age+language, iaformula=~education, + olink="logit", elink="logit", estimation.method="dr", data=SLID) > summary(fit) Call: drgee(oformula = highWage ~ education + age + language, eformula = sex ~ education + age + language, iaformula = ~education, olink = “logit", elink = “logit", data = SLID, estimation.method = “dr") Outcome: highWage Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: highWage ~ sexMale + sexMale:education Outcome nuisance model: highWage ~ education + age + languageFrench + languageOther Outcome link function: logit Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 2.9050 0.4015 7.236 4.62e-13 *** sexMale:education –0.1341 0.0295 –4.547 5.44e-06 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used
We observe that the DR estimates are quite different from the E-estimates in Example 4 (Section 3.4). This indicates that the exposure nuisance model [16] may not be well specified.
5 Estimation with clustered data
5.1 Theory
When data are clustered, the parameter estimates obtained in Sections 2–4 are still consistent. However, their standard errors must be corrected for within-cluster correlations. Suppose that data contain m independent clusters with
5.2 Example 7
To demonstrate estimation with clustered data, we use the dataset
the outcome nuisance model
and the exposure nuisance model
we can perform DR-estimation with cluster-corrected standard errors by setting the argument
> library(geepack) > data(ohio) > fit <- drgee(oformula=resp~age,eformula=smoke~age, + olink="logit",elink="logit", estimation.method="dr", + data=ohio, clusterid="id") > summary(fit) Call: drgee(oformula = resp ~ age, eformula = smoke ~ age, olink = "logit", elink ="logit", data = ohio, estimation.method = "dr", clusterid="id") Outcome: resp Exposure: smoke Covariates: age Main model: resp ~ smoke Outcome nuisance model: resp ~ age Outcome link function: logit Exposure nuisance model: smoke ~ age Exposure link function: logit Estimate Std. Error z value Pr(>|z|) smoke 0.2721 0.1781 1.528 0.127 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 2148 complete observations used Cluster-robust Std. errors using 537 clusters defined by levels of id
6 Simulation studies
To further demonstrate the capabilities of the
6.1 Simulation study 1
In the first simulation we generated data from the following model:
Under this model, we generated 1,000 samples with 500 observations each. Each sample was analyzed with O-estimation, E-estimation and DR-estimation. For each estimation method we carried out four analyses. In the first analysis,
In the third analysis we used the misspecified exposure nuisance model
In the fourth analysis we used both misspecified nuisance models [20] and [21]. For each estimation method, we calculated the mean (over the 1,000 samples) estimate of
Comparison of estimation methods. I: both nuisance models correctly specified, II: outcome nuisance model misspecified, III: exposure nuisance model misspecified, IV: both nuisance models misspecified.
| Mean estimate | Mean standard error | Empirical standard error | Empirical coverage probability of 95% CI | ||
| I | 1.497 | 0.106 | 0.107 | 0.941 | |
| 1.503 | 0.167 | 0.173 | 0.946 | ||
| 1.497 | 0.107 | 0.109 | 0.939 | ||
| II | 0.522 | 0.200 | 0.205 | 0.004 | |
| 1.503 | 0.167 | 0.173 | 0.946 | ||
| 1.503 | 0.167 | 0.173 | 0.946 | ||
| III | 1.497 | 0.106 | 0.107 | 0.941 | |
| 0.519 | 0.200 | 0.205 | 0.003 | ||
| 1.497 | 0.106 | 0.108 | 0.940 | ||
| IV | 0.522 | 0.200 | 0.205 | 0.004 | |
| 0.519 | 0.200 | 0.205 | 0.003 | ||
| 0.519 | 0.200 | 0.205 | 0.003 |
In the first analysis, all three mean estimates are close to the true value 1.5. This demonstrates that all three methods give unbiased estimates of
6.2 Simulation study 2
In the second simulation we generated data from the following model:
Under this model, we generated 1,000 samples with 500 observations each. Each sample was analyzed with O-estimation, E-estimation and DR-estimation. For each estimation method we carried out four analyses. In the first analysis, both outcome and exposure nuisance models were correct. In the second analysis, we used the misspecified outcome nuisance model
In the third analysis, we used the misspecified exposure nuisance model
In the fourth analysis we used both misspecified nuisance models [22] and [23]. We calculated the same summary statistics as in the first simulation. The results are shown in Table 2.
Comparison of estimation methods. I: both nuisance models correctly specified, II: outcome nuisance model misspecified, III: exposure nuisance model misspecified, IV: both nuisance models misspecified.
| Mean estimate | Mean estimated standard error | Empirical standard error | Empirical coverage probability of 95% CI | ||
| I | 1.528 | 0.269 | 0.266 | 0.961 | |
| 1.020 | 0.283 | 0.283 | 0.940 | ||
| 1.534 | 0.275 | 0.272 | 0.952 | ||
| 1.034 | 0.330 | 0.337 | 0.948 | ||
| 1.538 | 0.280 | 0.280 | 0.958 | ||
| 1.039 | 0.386 | 0.394 | 0.950 | ||
| II | 0.777 | 0.241 | 0.240 | 0.158 | |
| 1.278 | 0.248 | 0.251 | 0.819 | ||
| 1.534 | 0.275 | 0.272 | 0.952 | ||
| 1.034 | 0.330 | 0.337 | 0.948 | ||
| 1.538 | 0.281 | 0.277 | 0.960 | ||
| 1.038 | 0.372 | 0.379 | 0.951 | ||
| III | 1.528 | 0.269 | 0.266 | 0.961 | |
| 1.020 | 0.283 | 0.283 | 0.940 | ||
| 0.720 | 0.245 | 0.243 | 0.119 | ||
| 1.502 | 0.338 | 0.358 | 0.711 | ||
| 1.531 | 0.276 | 0.273 | 0.961 | ||
| 1.052 | 0.379 | 0.395 | 0.949 | ||
| IV | 0.777 | 0.241 | 0.240 | 0.158 | |
| 1.278 | 0.248 | 0.251 | 0.819 | ||
| 0.720 | 0.245 | 0.243 | 0.119 | ||
| 1.502 | 0.338 | 0.358 | 0.711 | ||
| 0.785 | 0.248 | 0.245 | 0.187 | ||
| 1.057 | 0.333 | 0.348 | 0.939 |
The results in Table 2 again demonstrate that the DR estimator is indeed doubly robust, i.e. that it is unbiased as long as at least one of the nuisance models is correctly specified. For the main effect
7 Discussion
In this paper we have summarized the theory behind O-estimation, E-estimation and DR estimation in restricted mean models. We have have also described, through practical examples, how the
There are other
Epidemiology is a rapidly evolving field, and it is desirable that applied epidemiologists use the best methods available when analyzing data. However, in our experience epidemiologists often resort to suboptimal standard methods, due to the lack of up-to-date software. We believe that the
In this Appendix we briefly review some basic theory for estimating equations, we refer to Newey and McFadden (1994) for a more detailed exposition.
Suppose that we are interested in a parameter
for
The inner element on the right-hand side is referred to as the “meat”, and the outer elements are refererred to as the “bread”. By substituting
Note that the asymptotic properties of the estimator
References
Bang, H., and Robins, J. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962–973.
Fox, J., and Weisberg, S. (2011). An R Companion to Applied Regression. 2nd Edition. Thousand Oaks, CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.
Højsgaard, S., Halekoh, U., and Yan, J. (2006). The R package geepack for generalized estimating equations. Journal of Statistical Software, 15:1–11.
Newey, W., and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245.
Robins, J., Mark, S., and Newey, W. (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48:479–495.
Robins, J., and Rotnitzky, A. (2001). Comment on ‘inference for semiparametric models: Some questions and an answer,’ by Bickel and Kwon. Statistica Sinica, 11:920–936.
Tchetgen Tchetgen, E., Robins, J., and Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97:171–180.
Tchetgen Tchetgen, E., and Rotnitzky, A. (2011). Double-robust estimation of an exposure-outcome odds ratio adjusting for confounding in cohort and case-control studies. Statistics in Medicine, 30:335–347.
Valeri, L., and VanderWeele, T. (2013). Mediation analysis allowing for exposure–mediator interactions and causal interpretation: Theoretical assumptions and implementation with SAS and SPSS macros. Psychological Methods, 18:137.
Zetterqvist, J., and Sjölander, A. (2015). drgee: doubly robust generalized estimating equations. http://CRAN.R-project.org/package=drgee, version 1.1.3.
