## Abstract

A common goal of epidemiologic research is to study the association between a certain exposure and a certain outcome, while controlling for important covariates. This is often done by fitting a restricted mean model for the outcome, as in generalized linear models (GLMs) and in generalized estimating equations (GEEs). If the covariates are high-dimensional, then it may be difficult to well specify the model. This is an important concern, since model misspecification may lead to biased estimates. Doubly robust estimation is an estimation technique that offers some protection against model misspecification. It utilizes two models, one for the outcome and one for the exposure, and produces unbiased estimates of the exposure-outcome association if either model is correct, not necessarily both. Despite its obvious appeal, doubly robust estimation is not used on a regular basis in applied epidemiologic research. One reason for this could be the lack of up-to-date software. In this paper we describe a new `R` package, `drgee`, which carries out doubly robust estimation in restricted mean models. The package is constructed to be user-friendly and fast, to facilitate routine use of doubly robust estimation. The paper is structured into theory sections and example sections. The former are intended to serve as a brief but self-consistent tutorial in doubly robust estimation. The latter illustrate the use of the `drgee` package through practical examples. We have used publically available data throughout the paper, so that the reader can easily replicate all examples.

## 1 Introduction

A common goal of epidemiologic research is to study the association between a certain exposure *A* and a certain outcome *Y*. Typically, it is desirable to control for covariates *L* in the analysis. This, for instance, is the case when the covariates are potential confounders for the exposure-outcome association, or when the covariates are potential mediators and the aim is to study the direct exposure effect. A common tool for covariate control is the restricted mean model

where *Y* is a scalar outcome and *g* is a suitable link function. Special cases include the linear, log-linear, and logistic models, which are typically used for continuous, “count”, and binary outcomes, respectively. The restricted mean model can be fitted by solving a set of GEEs. Alternatively, one may assume that *Y* has an exponential family distribution with canonical link function *g*, and fit the model by solving the GLM maximum likelihood score equations.

The model in eq. [1] can be decomposed into two parts:

and

The part in eq. [2] quantifies the conditional association between *A* and *Y*, given *L*. This part is usually of main interest; we thus refer to it as the “main model”. The parameter in the main model,

To protect the main model against bias bue to misspecification of the outcome nuisance model, doubly robust (DR) estimators have been proposed (e.g. Robins et al., 1992; Robins and Rotnitzky, 2001; Bang and Robins, 2005; Tchetgen Tchetgen et al., 2010). These estimators combine an outcome nuisance model with an exposure nuisance model, and are unbiased for the parameters in the main model if either nuisance model is correctly specified, not necessarily both. Thus, DR estimators give the researcher two chances instead of one to make valid inference on the parameters of main interest.

Despite their obvious appeal, DR estimators are not used on a regular basis in applied epidemiologic research. One reason for this could be the lack of up-to-date software. To remedy this deficiency we have implemented an `R` package, `drgee`, which carries out DR-estimation in linear, log-linear, and logistic restricted mean models (Zetterqvist and Sjölander, 2015). To facilitate routine use we have made an effort to make the `R` package as user-friendly and fast as possible. In particular, we have made sure that the package scales well for large data, and that it has an input/output interface which is similar to the standard model interface in `R`.

Three broad classes of estimation methods are implemented by the `drgee` package. The first class is DR-estimation, which requires both a nuisance model for the outcome and a nuisance model for the exposure. The second class only requires a nuisance model for the exposure; we refer to this class as “E-estimation”. Our definition of E-estimation covers both the original E-estimation in linear and log-linear models proposed by Robins et al. (1992) (often referred to as “G-estimation” in the causal infererence literature), and “retrospective maximum likelihood” in logistic models proposed by Tchetgen Tchetgen and Rotnitzky (2011). The third class is standard GEE estimation, which only requires a nuisance model for the outcome; to be consistent in jargon we refer to this class as “O-estimation”.

In this paper we describe the `R` package drgee. The paper is organized as follows. In Sections 2–Section 4 we review the theory of O-, E-, and DR-estimation, respectively, and illustrate through practical examples how these can be carried out with the `drgee` package. These sections are divided into “theory” and “example” subsections. The theory subsections are intended to serve as a brief but self-consistent tutorial in O-, E-, and DR-estimation; readers who want to get an immediate idea of how the `drgee` package works may skip the more difficult theory subsections at a first reading. An appeal of standard GEEs is their ability to handle clustered data. The `drgee` package can handle clustered data as well, which we illustrate in Section 5. In Section 6 we compare the three estimation methods, and illustrate the doubly robustness property of the DR estimator through a simulation study. In Section 7 we provide concluding remarks.

## 2 O-estimation

### 2.1 Theory

We assume that data consist of iid observations of *L*, and *L*. We consider the model

where *g* is a link function, *A* with *q*-dimensional parameter. Typically, both

and the outcome nuisance model

Our main interest lies in the main model and

When both models [5] and [6] are correct, we have that

It then follows from standard theory for estimating equations (see Appendix) that a consistent and asymptotically normal estimator

for

### 2.2 Example 1

To demonstrate how the `drgee` package can be used for O-estimation, we use the dataset `SLID` from the `car` package (Fox and Weisberg, 2011). This publically available dataset contains data from the 1994 wave of Canadian Survey of Labour and Income dynamics, with variables `wages`, `sex`, `education`, `age` and `language` for composite hourly wage (Canadian dollar), sex, years of education, age and native language respectively. The variables `sex` and `language` are factors with levels ("`Female`", "`Male`") and ("`English`", "`French`", "`Other`"), respectively. We refer to the `car` package for a more thorough documentation of the dataset.

Suppose that we wish to use these data study if there is a direct effect of sex on wages, not mediated through education level. Such a direct effect is clearly of substantive interest, since it would be an indication of sex discrimination. To eliminate the mediated effect we wish to control for education level. However, to avoid bias we must then additionally control for covariates that can be confounders for the mediator and the outcome (see Valeri and VanderWeele, 2013 and references therein). It is obvious that age can be such a mediator-outcome confounder, since age is likely to be associated with both education level and wages. Native language may also be a mediator-outcome confounder, by being associated with education level and wages through ethnicity and socio-economic status. We thus use a model for the mean of `wages` conditional on `sex`, `education`, `age` and `language`:

where the factors `sex` and `language` are recoded as dummy variables, with reference levels "`Female`" and "`English`", respectively. In this model,

The nuisance parameter is

To use O-estimation based on these models we type:

> library(car) > fit <- drgee(oformula=wages~education + age + language, + exposure="sex", estimation.method="o", data=SLID)

By setting the `estimation.method` argument to "`o`" we tell the `drgee` function to use O-estimation. The `oformula` argument specifies the outcome nuisance model and the `exposure` argument specifies the exposure. There is no need to explicitly specify the outcome, since this is identified by the `drgee` function as the response variabel in `oformula`. To summarize the results we type:

> summary(fit)

which gives us the output

Call: drgee(exposure = "sex", oformula = wages ~ education + age + language, data = SLID, estimation.method = "o") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: identity Estimate Std. Error z value Pr(>|z|) sexMale 3.4554 0.2091 16.53 <2e-16 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

Only the result for the main parameters are shown in the output. Since `drgee` only uses complete observations, 3428 incomplete observations in the original dataset `SLID` were not used in the calculations. The estimate of

### 2.3 Example 2

The linear and interaction-free model in Example 1 is simple, but may not be entirely realistic. Wage distributions are often right skewed, and therefore a linear model may not fit data very well. Furthermore, inference for direct effects may be misleading if significant exposure-mediator interactions are omitted from the model (see Valeri and VanderWeele, 2013 and references therein).

Therefore, suppose that we want to use a log-linear model instead, including an interaction between sex and education. We then replace the main model [7] with

in which `1,education`), and the outcome nuisance model [8] with

To use O-estimation based on these models we type:

> fit <- drgee(oformula=wages~education+age+language, + exposure="sex", iaformula=~education, olink="log", + estimation.method="o", data=SLID)

The `olink` argument specifies the link function *g*. It defaults to "`identity`" which gives a linear model. By setting the `olink` argument equal to "`log`" we tell the `drgee` function to fit a log-linear model instead. The `iaformula` argument specifies

Call: drgee(exposure = "sex", oformula = wages ~ education + age + language, iaformula = ~education, olink = "log", data = SLID, estimation.method = "o") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: log Estimate Std. Error z value Pr(>|z|) sexMale 0.581088 0.062848 9.246 < 2e-16 *** sexMale:education –0.026175 0.004511 –5.802 6.54e-09 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

We observe that both the main effect of sex and the sex-education interaction are highly significant.

## 3 E-estimation

In E-estimation we use the main model [5], as in O-estimation. However, in contrast to O-estimation, E-estimation leaves the covariate part `drgee` package implements E-estimation when *g* is the identity, log or logit link function. We discuss the identity and log link functions in Section 3.1. The logit function requires special treatment, and is discussed in Section 3.3.

### 3.1 Theory: E-estimation when *g* is the identity or log link function

When *g* is the identity or log link function, we use a model for the mean of the exposure conditional on the covariates:

In this model, *h* the identity, log or logit link function, *L*. We will refer to eq. [11] as the exposure nuisance model. To find an E-estimator of

When the main model [5] is correct with true parameter *g* is the identity link function, then

where the main model [5] was used in the last equality. If *g* is the log link function, then

where the main model [5] was used in the second equality. Since *A* and *L* and since *L*, we have (by the law of total expectation) that

The last expression equals 0 when the exposure nuisance model [11] is correct. This motivates the estimating function

When both the main model [5] and the exposure nuisance model [11] are correct, we have that

It then follows from standard theory for estimating equations (see Appendix) that a consistent and asymptotically normal estimator

for

### 3.2 Example 3

Continuing Example 2 (Section 2.3), suppose that we want to combine the main model [9] with the exposure nuisance model

To use E-estimation based on these models we would type:

> fit <- drgee(outcome="wages", + eformula=sex~education + age + language, + iaformula=~education, olink="log", elink="logit", + estimation.method="e", data=SLID)

By setting the `estimation.method` argument to "`e`" we tell the `drgee` function to use E-estimation. We then use an `eformula` argument to specify the exposure nuisance model. Now there is no need to explicitly specify the exposure, since this is identified by the `drgee` function as the response variable in `eformula`. The elink argument specifies the link function *h* for the exposure nuisance model. Since no outcome nuisance model is used in E-estimation, the `oformula` argument may be omitted. However, we then need to specify the `outcome` through the outcome argument. Summarizing the results gives:

Call: drgee(outcome = "wages", eformula = sex ~ education + age + language, iaformula = ~education, olink = "log", elink = "logit", data = SLID, estimation.method = "e") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome link function: log Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 0.370139 0.064773 5.714 1.1e-08 *** sexMale:education –0.010613 0.004738 –2.240 0.0251 * --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

We observe that the obtained E-estimates are quite different from the O-estimates (see Section 2.3). This indicates that at least one of the nuisance models [10] and [13] is misspecified.

### 3.3 Theory: E-estimation when *g* is the logit link function

When *g* is the logit link function and main model [5] holds, *A* and *Y* are binary we can instead use the symmetry of the odds ratio to perform E-estimation. In this case, the main model [5] is equivalent with the model

Combining this with the exposure nuisance model

where *L*, we arrive at the model

which has the same structural form as the model [4] used in O-estimation when *g* is the logit link function. Therefore, we can use the estimating function *A* and *Y* “switch place” in the estimating function. It then follows from standard theory for estimating equations (see Appendix) that a consistent and asymptotically normal estimator

for

### 3.4 Example 4

To perform E-estimation with logit outcome link function, we need both outcome and exposure to be binary. Suppose that we recode `wages` in the `SLID` dataset as a binary variable:

> SLID$highWage<-ifelse(SLID$wages< = 14,0,1)

We can then use the logistic main model

and the logistic exposure nuisance model

To use E-estimation based on these models we type:

> fit <- drgee(outcome = "highWage", eformula = sex~education + age + language, + iaformula = ~education, olink = "logit", elink = "logit", + estimation.method = "e", data = SLID) > summary(fit) Call: drgee(outcome = "highWage", eformula = sex ~ education + age + language, iaformula = ~education, olink = "logit", elink = "logit", data = SLID, estimation.method = "e") Outcome: highWage Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: highWage ~ sexMale + sexMale:education Outcome link function: logit Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 1.93520 0.30980 6.247 4.2e-10 *** sexMale:education –0.06361 0.02287 –2.782 0.00541 ** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

Again, both main effect and interaction are highly significant.

## 4 DR-estimation

In DR-estimation, we combine a main model with an outcome nuisance model and an exposure nuisance model to construct an estimator *g* is the logit link function separately.

### 4.1 Theory: DR-estimation when *g* is the identity or log link function

Let *g* is the identity or log link function, the results in Section 3 imply that

We then have that

The last expression equals 0 when at least one of the nuisance models [6] and [11] is correct. This motivates the estimating function

When the main model [5] is correct and either the outcome nuisance model [6] or the exposure nuisance model [14] is correct, we have that

It then follows from standard theory for estimating equations (see Appendix) that a consistent and asymptotically normal estimator

for

### 4.2 Example 5

Continuing Example 2 (Section 2.3) and Example 3 (Section 3.2), suppose that we want to combine the main model [9] with the outcome nuisance model [10] and the exposure nuisance model [13]. To use DR-estimation based on these models we type:

> fit <- drgee(oformula=wages~education+age+language, + eformula=sex~education+age+language, + iaformula=~education, olink="log", elink="logit", + estimation.method="dr", data=SLID)

By setting the `estimation.method` argument to "`dr`" we tell the `drgee` function to use DR-estimation. The `outcome` and `exposure` arguments may be omitted, since the `drgee` function identifies the outcome and exposure as the response variables in `oformula` and `eformula`, respectively. Summarizing the results gives:

> summary(fit) Call: drgee(oformula = wages ~ education + age + language, eformula = sex ~ education + age + language, iaformula = ~education, olink = "log", elink = "logit", data = SLID, estimation.method = "dr") Outcome: wages Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: wages ~ sexMale + sexMale:education Outcome nuisance model: wages ~ education + age + languageFrench + languageOther Outcome link function: log Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 0.57752 0.06333 9.119 < 2e-16 *** sexMale:education –0.02591 0.00455 –5.696 1.23e-08 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

We observe that the DR estimates are very similar to the O-estimates in Example 2 (Section 2.3), but quite different from the E-estimates in Example 3 (Section 3.2). This indicates that the exposure nuisance model may not be well specified. However, it does not prove that the outcome nuisance model is correct; we demonstrate in Section 6.1 that all methods may agree well even though both nuisance models are misspecified.

### 4.3 Theory: DR-estimation when *g* is the logit link function

When *g* is the logit link function, DR-estimation can be performed by the `drgee` package if the exposure *A* is binary and *h* is the logit link function. In this case the main model [5] is equivalent to the odds ratio model

By combining this model with the outcome nuisance model

and the exposure nuisance model [14] we can construct the estimating function

where

when the main model [18] is correct and either the outcome nuisance model [19] or the exposure nuisance model [14] is correct. It then follows from standard theory for estimating equations (see Appendix) that a consistent and asymptotically normal estimator

for *L*.

### 4.4 Example 6

Continuing Example 4 (Section 3.4), suppose that we want to combine main model [15], the exposure nuisance model [16] and the logistic outcome nuisance model

To use DR-estimation based on these models we type:

> fit <- drgee(oformula=highWage~education+age+language, + eformula=sex~education+age+language, iaformula=~education, + olink="logit", elink="logit", estimation.method="dr", data=SLID) > summary(fit) Call: drgee(oformula = highWage ~ education + age + language, eformula = sex ~ education + age + language, iaformula = ~education, olink = “logit", elink = “logit", data = SLID, estimation.method = “dr") Outcome: highWage Exposure: sexMale Covariates: education,age,languageFrench,languageOther Main model: highWage ~ sexMale + sexMale:education Outcome nuisance model: highWage ~ education + age + languageFrench + languageOther Outcome link function: logit Exposure nuisance model: sexMale ~ education + age + languageFrench + languageOther Exposure link function: logit Estimate Std. Error z value Pr(>|z|) sexMale 2.9050 0.4015 7.236 4.62e-13 *** sexMale:education –0.1341 0.0295 –4.547 5.44e-06 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 3987 complete observations used

We observe that the DR estimates are quite different from the E-estimates in Example 4 (Section 3.4). This indicates that the exposure nuisance model [16] may not be well specified.

## 5 Estimation with clustered data

### 5.1 Theory

When data are clustered, the parameter estimates obtained in Sections 2–4 are still consistent. However, their standard errors must be corrected for within-cluster correlations. Suppose that data contain *m* independent clusters with *i* for

### 5.2 Example 7

To demonstrate estimation with clustered data, we use the dataset `ohio` in `geepack` package (Højsgaard et al., 2006). This dataset contains data to study the health effects of air pollution on children. The children were examined annually at ages 7–10. Four variables are included in the dataset: `resp`, `smoke`, `age` and `id`. The variables `resp` (binary), `smoke` (binary) and age (continuous) indicate wheezing status, maternal smoking and age, respectively, at each examination. `id` is the child’s identification number. Suppose that we are interested in the association between maternal smoking and wheezing status conditional on age. Since all subjects are from the same city, we assume that the effect of air pollution is similar for all subjects in the sample. Assuming the main model

the outcome nuisance model

and the exposure nuisance model

we can perform DR-estimation with cluster-corrected standard errors by setting the argument `clusterid` to "`id`":

> library(geepack) > data(ohio) > fit <- drgee(oformula=resp~age,eformula=smoke~age, + olink="logit",elink="logit", estimation.method="dr", + data=ohio, clusterid="id") > summary(fit) Call: drgee(oformula = resp ~ age, eformula = smoke ~ age, olink = "logit", elink ="logit", data = ohio, estimation.method = "dr", clusterid="id") Outcome: resp Exposure: smoke Covariates: age Main model: resp ~ smoke Outcome nuisance model: resp ~ age Outcome link function: logit Exposure nuisance model: smoke ~ age Exposure link function: logit Estimate Std. Error z value Pr(>|z|) smoke 0.2721 0.1781 1.528 0.127 (Note: The estimated parameters quantify the conditional exposure-outcome association, given the covariates included in the nuisance models) 2148 complete observations used Cluster-robust Std. errors using 537 clusters defined by levels of id

## 6 Simulation studies

To further demonstrate the capabilities of the `drgee` package and to compare the estimation methods, we carried out two simulation studies.

### 6.1 Simulation study 1

In the first simulation we generated data from the following model:

Under this model, we generated 1,000 samples with 500 observations each. Each sample was analyzed with O-estimation, E-estimation and DR-estimation. For each estimation method we carried out four analyses. In the first analysis,

In the third analysis we used the misspecified exposure nuisance model

In the fourth analysis we used both misspecified nuisance models [20] and [21]. For each estimation method, we calculated the mean (over the 1,000 samples) estimate of

Mean estimate | Mean standard error | Empirical standard error | Empirical coverage probability of 95% CI | ||

I | 1.497 | 0.106 | 0.107 | 0.941 | |

1.503 | 0.167 | 0.173 | 0.946 | ||

1.497 | 0.107 | 0.109 | 0.939 | ||

II | 0.522 | 0.200 | 0.205 | 0.004 | |

1.503 | 0.167 | 0.173 | 0.946 | ||

1.503 | 0.167 | 0.173 | 0.946 | ||

III | 1.497 | 0.106 | 0.107 | 0.941 | |

0.519 | 0.200 | 0.205 | 0.003 | ||

1.497 | 0.106 | 0.108 | 0.940 | ||

IV | 0.522 | 0.200 | 0.205 | 0.004 | |

0.519 | 0.200 | 0.205 | 0.003 | ||

0.519 | 0.200 | 0.205 | 0.003 |

In the first analysis, all three mean estimates are close to the true value 1.5. This demonstrates that all three methods give unbiased estimates of

### 6.2 Simulation study 2

In the second simulation we generated data from the following model:

Under this model, we generated 1,000 samples with 500 observations each. Each sample was analyzed with O-estimation, E-estimation and DR-estimation. For each estimation method we carried out four analyses. In the first analysis, both outcome and exposure nuisance models were correct. In the second analysis, we used the misspecified outcome nuisance model

In the third analysis, we used the misspecified exposure nuisance model

In the fourth analysis we used both misspecified nuisance models [22] and [23]. We calculated the same summary statistics as in the first simulation. The results are shown in Table 2.

Mean estimate | Mean estimated standard error | Empirical standard error | Empirical coverage probability of 95% CI | ||

I | 1.528 | 0.269 | 0.266 | 0.961 | |

1.020 | 0.283 | 0.283 | 0.940 | ||

1.534 | 0.275 | 0.272 | 0.952 | ||

1.034 | 0.330 | 0.337 | 0.948 | ||

1.538 | 0.280 | 0.280 | 0.958 | ||

1.039 | 0.386 | 0.394 | 0.950 | ||

II | 0.777 | 0.241 | 0.240 | 0.158 | |

1.278 | 0.248 | 0.251 | 0.819 | ||

1.534 | 0.275 | 0.272 | 0.952 | ||

1.034 | 0.330 | 0.337 | 0.948 | ||

1.538 | 0.281 | 0.277 | 0.960 | ||

1.038 | 0.372 | 0.379 | 0.951 | ||

III | 1.528 | 0.269 | 0.266 | 0.961 | |

1.020 | 0.283 | 0.283 | 0.940 | ||

0.720 | 0.245 | 0.243 | 0.119 | ||

1.502 | 0.338 | 0.358 | 0.711 | ||

1.531 | 0.276 | 0.273 | 0.961 | ||

1.052 | 0.379 | 0.395 | 0.949 | ||

IV | 0.777 | 0.241 | 0.240 | 0.158 | |

1.278 | 0.248 | 0.251 | 0.819 | ||

0.720 | 0.245 | 0.243 | 0.119 | ||

1.502 | 0.338 | 0.358 | 0.711 | ||

0.785 | 0.248 | 0.245 | 0.187 | ||

1.057 | 0.333 | 0.348 | 0.939 |

The results in Table 2 again demonstrate that the DR estimator is indeed doubly robust, i.e. that it is unbiased as long as at least one of the nuisance models is correctly specified. For the main effect

## 7 Discussion

In this paper we have summarized the theory behind O-estimation, E-estimation and DR estimation in restricted mean models. We have have also described, through practical examples, how the `drgee` package in `R` can be used to perform these three types of estimations. Finally, we have carried out a simulation study to compare the estimation methods, and to demonstrate the doubly robustness property of the DR estimator.

There are other `R` packages available for doubly robust estimation, e.g. `tmle`, `ltmle`, `iWeigReg`, and `multiPIM`. However, these packages target a different parameter, namely the marginal (over covariates) exposure effect, whereas our package targets the conditional (on covariates) exposure effect.

Epidemiology is a rapidly evolving field, and it is desirable that applied epidemiologists use the best methods available when analyzing data. However, in our experience epidemiologists often resort to suboptimal standard methods, due to the lack of up-to-date software. We believe that the `drgee` package fills an important gap between theory and practice, and that it will facilitate the use of DR-estimation in the epidemiologic field.

**Funding statement: ****Funding**: This work was funded by the Swedish Research Council (grant/award no.: 340-2012-6007).

## Appendix: Theory for estimating equations

In this Appendix we briefly review some basic theory for estimating equations, we refer to Newey and McFadden (1994) for a more detailed exposition.

Suppose that we are interested in a parameter *O* and that there is a vector valued function

for *O*. The asymptotic covariance of

The inner element on the right-hand side is referred to as the “meat”, and the outer elements are refererred to as the “bread”. By substituting

Note that the asymptotic properties of the estimator

## References

Bang, H., and Robins, J. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962–973.Search in Google Scholar

Fox, J., and Weisberg, S. (2011). An R Companion to Applied Regression. 2nd Edition. Thousand Oaks, CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.Search in Google Scholar

Højsgaard, S., Halekoh, U., and Yan, J. (2006). The R package geepack for generalized estimating equations. Journal of Statistical Software, 15:1–11.Search in Google Scholar

Newey, W., and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245.Search in Google Scholar

Robins, J., Mark, S., and Newey, W. (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48:479–495.Search in Google Scholar

Robins, J., and Rotnitzky, A. (2001). Comment on ‘inference for semiparametric models: Some questions and an answer,’ by Bickel and Kwon. Statistica Sinica, 11:920–936.Search in Google Scholar

Tchetgen Tchetgen, E., Robins, J., and Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97:171–180.Search in Google Scholar

Tchetgen Tchetgen, E., and Rotnitzky, A. (2011). Double-robust estimation of an exposure-outcome odds ratio adjusting for confounding in cohort and case-control studies. Statistics in Medicine, 30:335–347.Search in Google Scholar

Valeri, L., and VanderWeele, T. (2013). Mediation analysis allowing for exposure–mediator interactions and causal interpretation: Theoretical assumptions and implementation with SAS and SPSS macros. Psychological Methods, 18:137.Search in Google Scholar

Zetterqvist, J., and Sjölander, A. (2015). drgee: doubly robust generalized estimating equations. http://CRAN.R-project.org/package=drgee, version 1.1.3.Search in Google Scholar

**Published Online:**2015-7-25

**Published in Print:**2015-12-1

©2015 by De Gruyter