1 Introduction
In many public health, biological, and biomedical systems, the mechanism that explains how an intervention or exposure affects the outcome of interest is unknown, even after a causal association between the exposure and the outcome is established. It is sometimes hypothesized that there exists a mediator that connects the exposure and the outcome, sitting on the causal pathway between the exposure and the outcome. In observational studies, identifying a plausible ideally pre-specified, mediator can strengthen the casual inference of the findings. For example, in an evaluation of the effectiveness of the ongoing, trillion dollar President’s Emergency Plan for AIDS Relief (PEPFAR) in reducing HIV incidence and prevention in sub-Saharan Africa, it would strengthen the evidence of a causal inference if it could be shown that a substantial proportion of the reduction in disease incidence in time was mediated by increased programmatic coverage in the region, thus diminishing exogenous time trends as the best explanation for any observed decline.
Several methods have been proposed to assess whether mediation exists and to quantify its magnitude [1, 2, 3, 4]. [5] described a sequence of hypothesis tests to assess the evidence in the data for mediation by a specific covariate. They assumed a linear model for the relationship between the outcome and the exposure, both marginally and conditionally on the mediator. They also assumed a linear model for the relationship between the exposure and the mediator. Within the counterfactual framework, the building blocks of mediation analysis are the natural direct effect (NDE), defined as the effect on the outcome when increasing the value of the exposure in one unit while holding the mediator at a fixed level, and the natural indirect effect (NIE), which is the effect on the outcome when the exposure is held fixed but the mediator value is changed as it would have been changed if the exposure value were increased by one unit [6, 7, 8]. The sum of the NDE and NIE is the total effect (TE). Under this framework, estimation methods for the TE, NDE and NIE were developed for various statistical models. Other examples include logistic regression [9], zero-inflated regression models [10] and high-dimensional mediators in linear regression with normal errors [11].
One way to estimate mediation is through the product method [5]. Another widely used method for assessing mediation is the difference method [5, 12, 13]. It quantifies the difference in estimates obtained from separate exposure-outcome relationship models, with and without the mediator. The mediation proportion, defined as the change in the effect of the exposure due to mediation by the mediator relative to the total effect, is a main parameter of interest when performing mediation analysis. An analogous measure in surrogacy analysis, termed proportion of treatment effect (PTE), aids researchers in deciding whether an intermediate marker can be used as a surrogate for a final outcome of interest. Quantifying PTE entails statistical questions relevant to those that arise in studying the mediation proportion. When the intermediate and the final outcomes are both binary, confidence intervals for the PTE were developed [14]. A time-to-event final outcome with surrogate biomarkers was also considered by [15], who used a data duplication algorithm in order to estimate the covariance between estimators obtained by separate models. The PTE measure in surrogacy research is still actively used and researched [16, e.g.,].
Methods for variance estimation, statistical testing and confidence interval construction of mediation parameters have been suggested by past authors. For the difference method, [14] suggested to use results from the linear model in binary outcome setup to approximate the covariance between the two estimators. Other approximation were described and compared in [4]. For the product method, variance can calculated either using delta method [17] or Goodmans exact variance-of-product formula [18]. However, since the finite samples behavior of the product method estimator was found to be nonnormal, the bootstrap was generally recommended [2, 19], at least for medium or small sample size. For the simulation-based mediation approach proposed by [20], a quasi-Bayesian Monte Carlo method or the bootstrap can be used [20, 21].
While the mediation proportion is a popular measure in mediation analysis [22, 23, 24, 25, e.g.,], statistical inference for this parameter is not sufficiently developed. The NIE and NDE are well-defined concepts, however, in practice, researchers are often primarily interested in the mediation proportion, as exemplified by the aforementioned papers. In this paper, we provide a framework for mediation analysis in generalized linear models (GLMs). We combine a generalized estimation equations (GEE) approach together with a data duplication algorithm to formulate valid statistical inference under minimal assumptions on the marginal and conditional distribution of the outcome. We discuss situations in which these assumptions should hold, and assess robustness to departures from these assumptions in extensive simulation studies. This paper further provides methods for statistical inference in mediation analysis using the difference method, including studying confidence intervals for the mediation proportion and hypothesis tests. Our investigation of these aspects is expanded beyond GLMs to inference about the mediation proportion for Cox model.
The reminder of this paper is organized as follows. In Section 2, we formulate the models needed for the estimation of the mediation proportion in GLMs. In Section 3, we consider the
2 The models
Assume
where
In this paper, we consider a mediator,
Let
The definitions of the NDE and NIE, as given by [7], use the counterfactual framework. Let
Under certain identifiability conditions, to estimate these effects, the mediation formula, given by [28, 29] and [6], can be used. Extensive work has been published on nonparametric and non-linear models [6, 20, 21, e.g., ].
In GLMs, alternative definitions for the NDE, NIE and TE have been proposed. For example, for a binary outcome following a logistic model, [30] defined the TE on the odds ratio scale as
and have shown that it can be decomposed to a product of a NDE and a NIE. Therefore, on the log odds ratio scale, the TE decomposes to the sum of the NDE and the NIE. See [30] and [31] for further details. While these definitions depend on the value of the confounders, they show that, for a rare outcome, under the logistic regression outcome model and a linear mediator-exposure regression model, estimates of the NDE and NIE can be obtained using estimates of the regression model parameters. This result also extends to the log link function in GLMs, without a rare outcome assumption.
In this paper, we consider the causal effects on the link function scale, and assuming no exposure-mediator interaction. That is, as discussed by [32], under the assumptions given below, the TE, NDE and NIE are
Throughout this paper, we assume that, after adjusting for measured confounders, there is no unmeasured confounding of the estimates of the exposure-outcome relationship, the mediator-outcome relationship or the exposure-mediator relationship. We also assume that confounders of the mediator-outcome relationship are unaffected by the exposure. Alternative identifiability assumptions have been given, e.g., in [6], which we will not consider further in this paper.
The mediation proportion, defined as
Under the aforementioned no-confounding assumptions, considering a unit change (i.e.,
In this paper, we focus on the difference method, for which we will develop asymptotic properties. Under
If
The question of mediation can also be investigated when the available data is survival data. A counterfactual framework for mediation analysis of survival data has been previously provided [33, 34, 35, 36]. [15] considered this question for the Cox model in the context of the PTE. First, as in [15], we define the following two models for the hazard function at time
where
3 Further results on -linkability
In this section, we consider the issue of when the full model (2) and the marginal model (3) both hold with the same function
We consider the three common link functions: identity, log and logit. For each of these functions we give a general condition for the distribution of
3.1 Identity link function
Under the identity link function, models (2) and (3) simplify to
We now show that
where
3.2 Log link function
Under the log link function,
and we have
Therefore, in the log link case,
3.3 Logit link function
The issue of whether the logistic regression model holds for both the conditional and marginal models has been discussed in the literature [37, 38, 39]. The logit link function, defined as
4 Inference for the mediation proportion
For simplicity of presentation, we assume throughout this section that
Asymptotic confidence intervals for
where
While
Assume now we have estimates
with
Past authors concentrated on methods for testing that the mediation proportion is at least some fraction
An alternative test statistic is based upon a test for the difference between the effect estimates in the marginal and conditional models. That is, on
Then, the null hypothesis is rejected if
4.1 The data duplication algorithm
A main challenge when conducting inference for the mediation proportion
The augmented data used by the data duplication algorithm. For each original observation
| i | j | Intercept | Intercept | |||||||
| 1 | 1 | 1 | 0 | 0 | 0 | |||||
| 1 | 2 | 0 | 1 | 0 | 0 | 0 | 2 | |||
| 2 | 1 | 1 | 0 | 0 | 0 | |||||
| 2 | 2 | 0 | 1 | 0 | 0 | 0 | ||||
The following pseudo model is fitted to the duplicated data using GEE [27],
where
where
Then, the estimating equations given by eq. (11) are identical to the estimating equations for fitting models (2) and (3) separately, because
5 Simulation study
The simulation studies and data analysis were conducted in R. The code is available upon request from the first author. In addition, we have developed a SAS macro and an R package that are publicly available (Appendix A.1). In the simulation studies, we considered several issues regarding the performance of the methodology we presented throughout the paper. We first present results concerning
Throughout these simulation studies, we assume that there are no confounders in the model.
5.1 -linkability for the logit link function and of the Cox model
In order to assess the magnitude of the bias when assuming
We chose the model parameter values in the following way. First, we chose
where
For the Cox model, we simulated the data similarly to the logit link function simulations. First, we simulated
In order to assess

Relative bias of the mediation proportion estimator under the logistic model as a function of the mediation proportion
Citation: The International Journal of Biostatistics 13, 2; 10.1515/ijb-2017-0006

Relative bias of the mediation proportion estimator under the Cox model as a function of the mediation proportion
Citation: The International Journal of Biostatistics 13, 2; 10.1515/ijb-2017-0006
Considering the
5.2 Estimation and inference performance
For the Cox model and the logit link function, data were simulated as described above. For the identity link function, data were simulated from the model
Under the simple linear model (identity link), estimates obtained from the difference and product methods are algebraically identical [41]. This result does not extend to the logistic link function or to the Cox model [3, 42]. We therefore compared between the difference and the product methods. Let
Percent relative bias and efficiency of estimators of the mediation proportion under the logistic link function.
| 0.005 | 0.1 | 0.1 | -0.23 | 1.00 | -0.82 | 1.00 | 2.84 | 1.14 |
| 0.5 | 0.01 | 1.00 | 0.04 | 1.00 | -0.14 | 1.00 | ||
| 0.7 | -0.01 | 1.00 | 0.02 | 1.00 | -0.04 | 1.00 | ||
| 0.3 | 0.3 | -0.07 | 1.00 | -0.20 | 1.00 | 0.12 | 0.99 | |
| 0.5 | -0.02 | 1.00 | -0.07 | 1.00 | -0.26 | 0.99 | ||
| 0.7 | -0.00 | 1.00 | -0.02 | 1.00 | -0.10 | 1.00 | ||
| 0.5 | 0.5 | -0.02 | 1.00 | -0.08 | 1.00 | -0.33 | 1.00 | |
| 0.7 | -0.01 | 1.00 | -0.02 | 1.00 | -0.11 | 1.00 | ||
| 0.01 | 0.1 | 0.1 | -0.55 | 0.99 | -1.00 | 1.02 | 6.00 | 1.13 |
| 0.5 | 0.02 | 1.00 | -0.08 | 1.00 | 0.22 | 0.99 | ||
| 0.7 | 0.02 | 1.00 | 0.04 | 1.00 | -0.12 | 1.00 | ||
| 0.3 | 0.3 | -0.11 | 1.00 | -0.42 | 0.99 | 0.88 | 1.01 | |
| 0.5 | -0.03 | 1.00 | -0.12 | 1.00 | -0.04 | 0.99 | ||
| 0.7 | -0.01 | 1.00 | -0.04 | 1.00 | 0.15 | 1.00 | ||
| 0.5 | 0.5 | -0.04 | 1.00 | -0.14 | 1.00 | -0.35 | 1.00 | |
| 0.7 | -0.01 | 1.00 | -0.04 | 1.00 | -0.20 | 1.00 | ||
| 0.1 | 0.1 | 0.1 | -1.44 | 0.97 | 11.52 | 0.97 | 46.42 | 1.33 |
| 0.5 | 0.27 | 1.00 | -0.63 | 0.99 | 1.45 | 0.96 | ||
| 0.7 | -0.18 | 1.00 | -0.32 | 1.00 | -0.24 | 0.98 | ||
| 0.3 | 0.3 | -0.93 | 0.99 | -0.86 | 0.97 | 11.26 | 0.97 | |
| 0.5 | -0.30 | 1.00 | -1.05 | 0.98 | 2.99 | 0.94 | ||
| 0.7 | -0.10 | 1.00 | -0.36 | 0.99 | -1.20 | 0.97 | ||
| 0.5 | 0.5 | -0.30 | 1.00 | -0.35 | 1.00 | 2.71 | 1.01 | |
| 0.7 | -0.09 | 1.00 | -0.41 | 1.00 | 0.62 | 0.99 | ||
Type
| Identity link function ( | |||||||
| 0.00 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | ||
| 0.01 | 0.04 | 0.05 | 0.05 | 0.05 | 0.05 | ||
| 0.01 | 0.05 | 0.04 | 0.05 | 0.06 | 0.06 | ||
| 0.28 | 0.60 | 0.91 | 0.90 | 0.90 | 0.88 | ||
| 0.02 | 0.07 | 0.37 | 0.38 | 0.79 | 0.79 | ||
| 0.02 | 0.05 | 0.14 | 0.15 | 0.35 | 0.36 | ||
| 0.28 | 0.52 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.07 | 0.18 | 0.90 | 0.91 | 1.00 | 1.00 | ||
| 0.03 | 0.10 | 0.47 | 0.49 | 0.89 | 0.89 | ||
| 0.54 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.17 | 0.36 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.06 | 0.18 | 0.83 | 0.84 | 0.99 | 1.00 | ||
| Logit link function ( | |||||||
| 0.03 | 0.06 | 0.04 | 0.05 | 0.04 | 0.04 | ||
| 0.03 | 0.05 | 0.05 | 0.06 | 0.05 | 0.05 | ||
| 0.04 | 0.05 | 0.04 | 0.04 | 0.05 | 0.05 | ||
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.09 | 0.13 | 0.34 | 0.36 | 0.74 | 0.74 | ||
| 0.06 | 0.09 | 0.15 | 0.16 | 0.35 | 0.36 | ||
| 0.85 | 0.89 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.31 | 0.40 | 0.86 | 0.87 | 1.00 | 1.00 | ||
| 0.13 | 0.17 | 0.42 | 0.44 | 0.87 | 0.87 | ||
| 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.66 | 0.73 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.25 | 0.32 | 0.74 | 0.76 | 0.99 | 0.99 | ||
| Cox model with | |||||||
| 0.03 | 0.05 | 0.04 | 0.04 | 0.04 | 0.04 | ||
| 0.03 | 0.05 | 0.04 | 0.05 | 0.06 | 0.06 | ||
| 0.03 | 0.05 | 0.06 | 0.06 | 0.06 | 0.06 | ||
| 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.09 | 0.14 | 0.33 | 0.35 | 0.75 | 0.76 | ||
| 0.05 | 0.09 | 0.15 | 0.16 | 0.32 | 0.33 | ||
| 0.84 | 0.88 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.33 | 0.43 | 0.85 | 0.86 | 1.00 | 1.00 | ||
| 0.15 | 0.19 | 0.46 | 0.47 | 0.87 | 0.88 | ||
| 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.67 | 0.75 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.25 | 0.33 | 0.78 | 0.80 | 1.00 | 1.00 | ||
From estimation we move to hypothesis testing. The two test statistics compared were described in Section 4, where the variance estimators used in the test statistics were obtained by the data duplication algorithm described in Section 4.1. Results are presented in Table 3. In terms of type
Empirical coverage rates (CR) and lengths (LEN) of confidence intervals for the mediation proportion under the identity and logit link functions and the Cox model
| Identity link function ( | ||||||||
| CR | 0.90 | 0.96 | 0.95 | 0.94 | 0.95 | 0.94 | ||
| LEN | 0.24 | 0.13 | 0.12 | 0.10 | 0.06 | 0.05 | ||
| CR | 0.98 | 0.95 | 0.96 | 0.96 | 0.95 | 0.96 | ||
| LEN | 0.85 | 0.25 | 0.15 | 0.33 | 0.11 | 0.07 | ||
| CR | 0.98 | 0.96 | 0.95 | 0.96 | 0.95 | 0.96 | ||
| LEN | 1.42 | 0.41 | 0.24 | 0.56 | 0.18 | 0.11 | ||
| CP | 0.93 | 0.95 | 0.94 | 0.96 | 0.97 | 0.95 | ||
| LEN | 0.65 | 0.20 | 0.14 | 0.25 | 0.09 | 0.06 | ||
| CR | 0.96 | 0.96 | 0.94 | 0.96 | 0.95 | 0.96 | ||
| LEN | 0.93 | 0.28 | 0.17 | 0.37 | 0.12 | 0.07 | ||
| CR | 0.97 | 0.96 | 0.96 | 0.95 | 0.95 | 0.97 | ||
| LEN | 1.51 | 0.43 | 0.26 | 0.59 | 0.19 | 0.11 | ||
| CR | 0.94 | 0.94 | 0.96 | 0.96 | 0.96 | 0.95 | ||
| LEN | 1.11 | 0.32 | 0.20 | 0.43 | 0.14 | 0.09 | ||
| CR | 0.96 | 0.95 | 0.95 | 0.96 | 0.96 | 0.95 | ||
| LEN | 1.65 | 0.47 | 0.27 | 0.63 | 0.20 | 0.12 | ||
| Logit link function ( | ||||||||
| CR | 0.95 | 0.95 | 0.87 | 0.96 | 0.93 | 0.80 | ||
| LEN | 0.13 | 0.07 | 0.04 | 0.08 | 0.05 | 0.03 | ||
| CR | 0.97 | 0.96 | 0.94 | 0.96 | 0.95 | 0.96 | ||
| LEN | 0.49 | 0.26 | 0.15 | 0.33 | 0.19 | 0.11 | ||
| CR | 0.96 | 0.95 | 0.96 | 0.95 | 0.94 | 0.95 | ||
| LEN | 0.84 | 0.43 | 0.25 | 0.57 | 0.31 | 0.18 | ||
| CR | 0.96 | 0.95 | 0.92 | 0.95 | 0.95 | 0.94 | ||
| LEN | 0.39 | 0.20 | 0.11 | 0.26 | 0.13 | 0.08 | ||
| CR | 0.96 | 0.95 | 0.93 | 0.96 | 0.96 | 0.95 | ||
| LEN | 0.55 | 0.29 | 0.17 | 0.37 | 0.20 | 0.12 | ||
| CR | 0.96 | 0.94 | 0.96 | 0.95 | 0.95 | 0.94 | ||
| LEN | 0.88 | 0.47 | 0.26 | 0.59 | 0.32 | 0.19 | ||
| CR | 0.95 | 0.95 | 0.94 | 0.96 | 0.96 | 0.94 | ||
| LEN | 0.65 | 0.34 | 0.20 | 0.45 | 0.24 | 0.14 | ||
| CR | 0.97 | 0.94 | 0.95 | 0.96 | 0.95 | 0.93 | ||
| LEN | 0.95 | 0.49 | 0.29 | 0.66 | 0.34 | 0.20 | ||
| Cox model with | ||||||||
| CR | 0.93 | 0.95 | 0.92 | 0.93 | 0.81 | 0.73 | ||
| LEN | 0.12 | 0.08 | 0.07 | 0.05 | 0.06 | 0.04 | ||
| CR | 0.96 | 0.95 | 0.96 | 0.94 | 0.94 | 0.95 | ||
| LEN | 0.48 | 0.34 | 0.26 | 0.18 | 0.15 | 0.10 | ||
| CR | 0.98 | 0.95 | 0.95 | 0.96 | 0.97 | 0.94 | ||
| LEN | 0.83 | 0.56 | 0.43 | 0.30 | 0.25 | 0.18 | ||
| CR | 0.95 | 0.96 | 0.95 | 0.95 | 0.93 | 0.91 | ||
| LEN | 0.38 | 0.26 | 0.19 | 0.13 | 0.11 | 0.08 | ||
| CR | 0.97 | 0.96 | 0.95 | 0.95 | 0.95 | 0.94 | ||
| LEN | 0.57 | 0.38 | 0.29 | 0.20 | 0.17 | 0.12 | ||
| CR | 0.96 | 0.94 | 0.95 | 0.96 | 0.95 | 0.95 | ||
| LEN | 0.87 | 0.60 | 0.45 | 0.32 | 0.26 | 0.18 | ||
| CR | 0.96 | 0.95 | 0.94 | 0.96 | 0.94 | 0.94 | ||
| LEN | 0.66 | 0.44 | 0.34 | 0.24 | 0.20 | 0.14 | ||
| CR | 0.96 | 0.96 | 0.95 | 0.95 | 0.95 | 0.95 | ||
| LEN | 0.94 | 0.63 | 0.49 | 0.34 | 0.28 | 0.20 | ||
Ratio between mean estimated
| Identity link function ( | ||||
| 0.989 | 0.994 | 0.985 | ||
| 1.036 | 1.014 | 0.963 | ||
| 0.960 | 0.962 | 1.098 | ||
| 1.031 | 0.959 | 0.967 | ||
| 1.103 | 0.889 | 0.989 | ||
| 0.988 | 0.975 | 0.955 | ||
| 1.064 | 0.959 | 1.008 | ||
| 1.010 | 0.980 | 1.100 | ||
| Logit link function ( | ||||
| 0.996 | 0.968 | 1.031 | ||
| 1.070 | 0.977 | 1.023 | ||
| 1.012 | 0.976 | 0.940 | ||
| 1.002 | 0.954 | 1.057 | ||
| 0.954 | 1.016 | 1.080 | ||
| 0.958 | 0.902 | 0.995 | ||
| 0.993 | 0.983 | 1.008 | ||
| 0.996 | 1.011 | 0.959 | ||
| Cox model with | ||||
| 1.089 | 1.035 | 0.951 | ||
| 1.011 | 1.013 | 0.945 | ||
| 1.099 | 1.044 | 1.068 | ||
| 1.053 | 0.892 | 0.906 | ||
| 0.957 | 0.990 | 0.887 | ||
| 0.999 | 0.982 | 0.966 | ||
| 0.945 | 0.938 | 1.036 | ||
| 0.961 | 1.057 | 0.963 | ||
For the difference method, we also compared the confidence intervals using the asymptotic variance to confidence intervals constructed using the bootstrap, both by estimating the bootstrapped variance and assuming normality, and by using the quantiles of the bootstrap samples. The results, presented in Web Appendix C, show that the asymptotic approach is comparable to both bootstrap procedures in terms of nominal coverage. Both versions of the bootstrap confidence intervals were wider than the asymptotic confidence intervals. Furthermore, the bootstrap is time consuming, especially for large data sets, where our method, implemented by publicly-available software, is as fast as a single GEE model fit. For
Throughout this section, we presented in parallel results for the identity and logit link function and the Cox model. There was a very strong agreement between the results for the logit link function for binary data and the Cox model, as one may have expect given the close relationship between the logistic regression model and the Cox model in epidemiology and public health evaluations.
In addition to the scenarios we described above, we conducted simulations for the identity link function with error distributions other than the normal one. We considered symmetric distribution with tails heavier than the normal distribution as well as skewed distributions. As predicted by GEE theory, the performance of the mediation proportion estimator, the statistical tests and the confidence interval was only slightly changed. Details are given in Web Appendix D.
6 Illustrative example
We illustrate the use of our methodology in an analysis of the etiology of pre-menopausal breast cancer data from the Nurses Health’s Studies (NHS and NHSII) [43, 44]. It was previously found that high mammographic density (MD) is a risk factor for breast cancer [45]. The goal here is to investigate whether, and to what extent, the effects of more distal risk factors for pre-menopausal breast cancer are mediated by high MD. A detailed description of this study is given in [46]. In this nested case-control study, controls were matched to cases by current age, menopausal status, current hormone use, month, time of day, fasting status and time of the day at blood collection and luteal day (for NHSII samples only). There were 559 pre-menopausal cases and 1727 controls. Since the disease is rare, as shown in the previous section,
We only considered risk factors with significant total effects: personal history of benign breast disease (HBBD), family history of breast cancer (FH), adolescent somatotype (ASM), body mass index at age 18 (BMI18), age at first birth (AFB), age at menarche (AM) and height (HT). In each of the analyses, we included potential confounders for the risk factor-MD, MD-outcome and risk factor-outcome relationships, based upon subject matter considerations following [46]. For example, we did not adjust for current (adult) BMI in the analysis of BMI18, as the latter affects the former. In addition, some of the risk factors studied may have been confounders in analysis of mediation via MD of another risk factor. The set of confounders used in at least one analysis included current age, fasting status, blood collection time of the day, mammography batch (NHS batch 1, NHS batch 2 or NHSII), current BMI, BMI18, ASM, HBBD, parity, AFB, and AM. As in most observational studies, residual confounding may bias our results.
Since our method assumes no exposure-mediator interaction, we fitted, for each risk factor, a logistic regression that included the risk factor, mediator, potential confounders and an interaction term involving the risk factor and the mediator. Then we tested for interaction using a standard Wald test.
Table 6 presents the estimated mediation proportions, confidence intervals, p-values, the estimated risk factor effects, and the p-value for risk factor-mediator interaction. MD was a significant mediator (at the 5% significance level) for HBBD, ASM and BMI18, regardless whether the test was based on
There was no evidence for risk factor-mediator interaction for all but one risk factor studied here. For HT, the test for the interaction term was significant. However, the point estimate for the proportion of the effect of HT mediated through MD is very close to zero. Thus, the fact that this assumption is violated is unlikely to be of substantive importance. In the supplementary materials of [46], it was reported that when taking the interaction into account using the method of [30], the mediation proportion remained very small, although positive.
The results suggest that, if the non-confounding assumptions needed for causal interpretation of observed associations are met, MD mediates the effect of at least some pre-menopausal breast cancer risk factors, with evidence for a large mediation proportion for BMI18 and ASM and some mediation of HBBD, but not for the other risk factors.
Mediation analysis for pre-menopausal breast cancer incidence with mammographic density as the mediator in the NHS and NHSII studies (
| Risk factor | 95% CI | |||||||
| Personal history of benign breast disease | 0.95 | 0.35 (1.42) | 0.001 | 0.25 (1.28) | 0.30 | 0.10-0.51 | 0.004 | |
| Family history of breast cancer | 0.59 | 0.42 (1.52) | 0.01 | 0.42 (1.52) | 0.004 | -0.10-0.11 | 0.94 | 0.94 |
| Adolescent somatotype | 0.20 | -0.34 (0.72) | 0.02 | -0.12 (0.88) | 0.63 | 0.05-1.20 | 0.03 | |
| BMI at age 18 | 0.20 | -0.23 (0.79) | 0.02 | -0.05 (0.95) | 0.78 | 0.06-1.50 | 0.03 | |
| Age at first birth | 0.17 | 0.15 (1.17) | 0.03 | 0.15 (1.16) | 0.03 | -0.09-0.15 | 0.31 | 0.30 |
| Age at menarche Per 2 year increase | 0.22 | -0.16 (0.86) | 0.03 | -0.18 (0.84) | -0.16 | -0.36-0.04 | 0.12 | 0.04 |
| Height Per 3 inch increase | 0.03 | 0.13 (1.14) | 0.03 | 0.14 (1.14) | -0.01 | -0.14-0.11 | 0.82 | 0.82 |
Adjusted for age, fasting status, blood collection time of the day, mammography batch (NHS batch 1, NHS batch 2 or NHSII), current and at age 18 BMI, adolescent somatotype, history of BBD, parity, age at first birth, and age at menarche
7 Discussion
In this paper, we have provided methodology for estimation and inference for the mediation proportion in GLMs and the Cox model using the difference method. Our methodology for GLMs uses a data duplication algorithm with GEE and allows for the consistent estimation of the covariance of the estimates.
Strictly speaking, the validity of the difference method relies on the assumption that the marginal model, the one that does not include the mediator, and the conditional model, the one that does, hold simultaneously. However, we demonstrate in this paper that
Despite its popularity, the difference method for estimating the mediation proportion has been criticized due to what appeared to be undesirable finite samples properties [4, 40]. However, when considering binary outcomes, the covariance (or correlation) between estimates from the marginal and conditional models was typically estimated using approximations from the linear model [1]. We have now developed methodology for a valid covariance estimator and showed that testing for mediation using a test based on the difference yields a valid statistical test, even in finite samples.
An alternative to the difference method is the product method. In terms of finite sample properties, we have shown the two methods are comparable under
The causal structure and the underlying confounding assumptions are important to consider when our methods are used in applications. Confounding may occur due to exposure-mediator confounders, exposure-outcome confounders or mediator-outcome confounders. We refer the readers to [19], and references therein, for relevant discussions on assumptions needed and analysis conducted in order to avoid, or at least minimize, potential bias due to confounding when conducting mediation analysis. The difference method does not allow for mediator-exposure interaction, and alternative methods to allow for this interaction were previously developed [34, 38, 49].
In practice, mediation analysis is often conducted for well-established exposures or risk factors, or when the total effect is significant. As suggested by our simulation results, when the total effect was small, mediation analysis was less likely to provide adequate results. On the other hand, an analysis that only considers significant total effects should take into account that it was performed conditionally on the results of a first stage analysis. The properties of such conditional inference can be considered in future research.
In our implementation of the GEE methodology, we propose to use the independence working correlation matrix, which has the nice property of providing identical coefficient estimates when fitting the two models separately and when using the data duplication algorithm, fitting them together. Under other working correlation matrices, this property does not hold anymore, but efficiency may be gained.
In conclusion, the general framework for mediation analysis in GLMs developed in this paper along with the methodology established, will allow researchers to investigate mediation under various outcome scenarios and to quantify results based on rigorously derived and empirically studied estimators and hypothesis tests.
The SAS macro
References
- [1]↑
Freedman LS, Schatzkin A. Sample size for studying intermediate endpoints within intervention trials or observational studies. Am. J. Epidemiol. 1992;136:1148–1159.
- [3]↑
MacKinnon DP, Dwyer JH. Estimating mediated effects in prevention studies. Eval. Rev. 1993;17:144–158.
- [4]↑
MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol. Meth. 2002;7:83.
- [5]↑
Baron RM, Kenny DA. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. J. Personality Social Psychol. 1986;51:1173.
- [6]↑
Imai K, Keele L, Identification Yamamoto T., inference and sensitivity analysis for causal mediation effects. Stat. Sci. 2010b:51–71.
- [7]↑
Pearl J. Direct and indirect effects. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001:411–420.
- [8]↑
Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 1992:143–155.
- [9]↑
Huang B, Sivaganesan S, Succop P, Goodman E. Statistical assessment of mediational effects for logistic mediational models. Stat. Med. 2004;23:2713–2728.
- [10]↑
Wang W., Albert JM. Estimation of mediation effects for zero-inflated regression models. Stat. Med. 2012;31:3118–3132.
- [11]↑
Huang Y-T, Pan W-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics 2015.
- [12]↑
Alwin DF, Hauser RM. The decomposition of effects in path analysis. Am. Sociological Rev. 1975:37–47.
- [13]↑
Judd CM. Kenny DA. Process analysis estimating mediation in treatment evaluations. Eval. Rev. 1981;5:602–619.
- [14]↑
Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Stat. Med. 1992;11:167–178.
- [15]↑
Lin D, Fleming T, De Gruttola V. Estimating the proportion of treatment effect explained by a surrogate marker. Stat. Med. 1997;16:1515–1527.
- [16]↑
Parast L, McDermott MM, Tian L. Robust estimation of the proportion of treatment effect explained by surrogate marker information. Stat Med. 2015.
- [17]↑
Sobel ME. Asymptotic confidence intervals for indirect effects in structural equation models. Soc. Method. 1982;13:290–312.
- [19]↑
VanderWeele T. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
- [20]↑
Imai K, Keele L, Tingley D. A general approach to causal mediation analysis. Psychol. Meth. 2010a;15:309.
- [21]↑
Tingley D, Yamamoto T, Hirose K, Keele L, Imai K. Mediation: R package for causal mediation analysis. 2014.
- [22]↑
Carney, RM, Howells WB, Blumenthal JA, Freedland KE, Stein PK, Berkman LF, Watkins LL, Czajkowski SM, Steinmeyer B, Hayano J, et al. Heart rate turbulence, depression, and survival after acute myocardial infarction. Psychosomatic Med. 2007;69:4–9.
- [23]↑
Lyall K, Ashwood P, Van de Water J, Hertz-Picciotto I. Maternal immune-mediated conditions, autism spectrum disorders, and developmental delay. J. Autism Dev. Disorders 2014;44, 1546–1555.
- [24]↑
Reisner SL, Greytak EA, Parsons JT, Ybarra ML. Gender minority social stress in adolescence: disparities in adolescent bullying and substance use by gender identity. J. Sex Res. 2015;52:243–256.
- [25]↑
Roberts AL, Rosario M, Corliss HL, Koenen KC, Austin SB. Childhood gender nonconformity: A risk indicator for childhood abuse and posttraumatic stress in youth. Pediatrics 2012;129:410–417.
- [27]↑
Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986:13–22.
- [29]↑
Pearl J. The causal mediation formula–a guide to the assessment of pathways and mechanisms. Prev. Sci. 2012;13:426–436.
- [30]↑
VanderWeele TJ, Vansteelandt S. Odds ratios for mediation analysis for a dichotomous outcome. Am. J. Epidemiol. 2010;172:1339–1348.
- [31]↑
Valeri, L., Lin X, VanderWeele TJ. Mediation analysis when a continuous mediator is measured with error and the outcome follows a generalized linear model. Stat. Med. 2014;33:4875–4890.
- [32]↑
Tchetgen Tchetgen EJ. Inverse odds ratio-weighted estimation for causal mediation analysis. Stat. Med. 2013;32:4567–4580.
- [33]↑
Lange T, Hansen JV. Direct and indirect effects in a survival context. Epidemiology 2011;22:575–581.
- [34]↑
Lange T, Vansteelandt S, Bekaert M. A simple unified approach for estimating natural direct and indirect effects. Am. J. Epidemiol. 2012;176:190–195.
- [35]↑
Tchetgen Tchetgen EJ. On causal mediation analysis with a survival outcome. Int. J. Biostat. 2011;7:1–38.
- [36]↑
VanderWeele TJ. Causal mediation analysis with survival data. Epidemiology (Cambridge, Mass.) 2011;22:582.
- [37]↑
Ritz J, Spiegelman D. Equivalence of conditional and marginal regression models for clustered and longitudinal data. Stat. Meth. Med. Res. 2004;13:309–323.
- [38]↑
Valeri L, VanderWeele TJ. Mediation analysis allowing for exposure–mediator interactions and causal interpretation: Theoretical assumptions and implementation with sas and spss macros. Psychol. Methods 2013;18:137.
- [39]↑
Jiang Z, VanderWeele TJ. When is the difference method conservative for assessing mediation? Am. J. Epidemiol. 2015;182:105–8.
- [40]↑
Freedman LS. Confidence intervals and statistical power of the “validation” ratio for surrogate or intermediate endpoints. J. Stat. Plann. Inference 2001;96:143–153.
- [41]↑
MacKinnon DP, Warsi G, Dwyer JH. A simulation study of mediated effect measures. Multivariate Behav. Res. 1995;30, 41–62.
- [42]↑
Tein J-Y, MacKinnon DP. Estimating mediated effects with survival data. In: New developments in psychometrics. Springer, 2003:405–412.
- [43]↑
Belanger CF, Hennekens CH, Rosner B, Speizer FE. The nurses’ health study. Am. J. Nursing 1978;78:1039–1040.
- [44]↑
Wolf AM, Hunter DJ, Colditz GA, Manson JE, Stampfer MJ, Corsano KA, Rosner B, Kriska A, Willett WC. Reproducibility and validity of a self-administered physical activity questionnaire. Int. J. Epidemiol. 1994;23:991–999.
- [45]↑
McCormack VA, dos Santos Silva I. Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis. Cancer Epidemiol. Biomarkers Prev. 2006;15:1159–1169.
- [46]↑
Rice MS, Bertrand KA, VanderWeele TJ, Rosner BA, Liao X, Adami H-O, Tamimi RM. Mammographic density and breast cancer risk: a mediation analysis. Breast Cancer Res. 2016;18:94.
- [47]↑
Spiegelman D, Hertzmark E. Easy sas calculations for risk or prevalence ratios and differences. Am. J. Epidemiol. 2005;162:199–200.
- [48]↑
Stone CJ. The dimensionality reduction principle for generalized additive models. Ann. Stat. 1986:590–606.
- [49]↑
Steen J, Loeys T, Moerkerke B, Vansteelandt S. medflex: An r package for flexible mediation analysis using natural effect models. J. Stat. Softw. 2017;76:1–46.
Footnotes
Supplemental Material
The online version of this article offers supplementary material (


