Accessible Published by De Gruyter November 29, 2019

Causal Mediation Analysis in the Presence of a Misclassified Binary Exposure

Zhichao Jiang and Tyler VanderWeele
From the journal Epidemiologic Methods

Abstract

Mediation analysis is popular in examining the extent to which the effect of an exposure on an outcome is through an intermediate variable. When the exposure is subject to misclassification, the effects estimated can be severely biased. In this paper, when the mediator is binary, we first study the bias on traditional direct and indirect effect estimates in the presence of conditional non-differential misclassification of a binary exposure. We show that in the absence of interaction, the misclassification of the exposure will bias the direct effect towards the null but can bias the indirect effect in either direction. We then develop an EM algorithm approach to correcting for the misclassification, and conduct simulation studies to assess the performance of the correction approach. Finally, we apply the approach to National Center for Health Statistics birth certificate data to study the effect of smoking status on the preterm birth mediated through pre-eclampsia.

1 Introduction

Causal mediation analysis investigates the role of an intermediate variable in explaining the relationship between an exposure variable and an outcome of interest. Mediation analysis has been widely used in various fields and was made popular by the seminal work of Baron and Kenny (1986). Recently, mediation analysis has been advanced by applying the counterfactual framework (e. g. Robins and Greenland 1992; Pearl 2001; VanderWeele and Vansteelandt 2009, 2010; Imai, Keele, and Tingley 2010). The use of the counterfactual framework allows the total effect to be decomposed into a direct effect and an indirect effect even in the presence of interactions and non-linearities (VanderWeele and Vansteelandt 2009, 2010).

In practice, variables may not be measured accurately, which leads to the problem of measurement error. For binary variables, this is usually referred to as misclassification. As demonstrated by Aigner (1973), the classification error of a binary regressor may have a non-zero mean and is negatively correlated with the true underlying variable. This is different from the measurement error of a continuous variable which is usually assumed to be centered and uncorrelated with the underlying variable. As a result, for misclassification problems, the traditional ordinary least squares estimators will lead to incorrect conclusions. To deal with this problem, several papers propose methods to correct for the misclassification (e. g. Freeman 1983; Savoca 2000).

In the context of mediation analysis, when a binary mediator is misclassified, Valeri and VanderWeele (2014) deduce the asymptotic bias for the estimators using the observed mediator and propose correction approaches, allowing for the presence of exposure-mediator interaction. In the non-parametric setting, VanderWeele, Valeri, and Ogburn (2012) investigate the sign of the bias due to the misclassification of a binary mediator. There are also many results on the measurement error of a continuous mediator in various settings (e. g. Hoyle and Kenny 1999; le Cessie et al. 2012; Valeri, Lin, and VanderWeele 2014). However, no one has considered work on misclassification of a binary exposure in the context of mediation, though this is a common problem. For example, in our data illustration, the exposure is the smoking status. Because the smoking status of study participants is self-reported, the exposure is typically subject to misclassification, which may jeopardize our inference.

The present work makes two main contributions. First, when the mediator is binary, we study the implications of using a misclassified exposure in the non-parametric setting. The traditional non-differential misclassification assumption requires that the classification error is independent of other variables. In contrast, we allow the classification error to depend on the covariates. Thus, our assumption of misclassification is weaker than the traditional non-differential misclassification assumption. We derive the relationships between the naive estimators using the observed exposure and the true natural direct and indirect effects.

Second, we propose an approach to correcting the bias for estimating natural direct and indirect effects. The approach requires a logistic model for the exposure, and pre-specified sensitivity and specificity parameters for the misclassification. We can use this approach in sensitivity analysis by varying the sensitivity and specificity parameters in a pre-specified range. To evaluate the performance of the approach, we conduct a simulation study with different values of the sensitivity and specificity parameters.

The paper is organized as follows. We introduce the counterfactual framework, notation and confounding assumptions in Section 2. In Section 3, we present the results for biases when the exposure is misclassified and the mediator is binary in the non-parametric setting. We propose an approach to correcting the bias for estimating direct and indirect effects in Section 4 and conduct a simulation study in Section 5. In Section 6, we apply the proposed approach to a perinatal epidemiological study. Finally, we conclude with a discussion in Section 7. In the online supplementary materials, we present all the proofs and  details for the simulation study and the real data example.

2 Mediation analysis based on the counterfactual framework

Let A denote a binary exposure, M denote a mediator, Y denote an outcome of interest, and C denote a vector of covariates. Let Ya and Ma denote the values of the outcome and mediator that would have been observed if the exposure A had been set to level a. Let Yam denote the value of the outcome that would have been observed if the exposure and the mediator had been set to levels a and m, respectively. Let YaMa denote the value of the outcome that would have been observed if the exposure and the mediator had been set to a and Ma, respectively. The average total effect, conditional on C=c, comparing exposure levels 1 with 0, is defined by TEc=E(Y1Y0C=c), which compares the average outcome in subgroup C=c if A had been set to 1 with the average outcome in subgroup C=c if A had been set to 0. The natural direct effect, conditional on C=c, comparing the effect of the exposure levels 1 and 0 while fixing the mediator to the level it would have naturally been under some reference condition for the exposure, A = 0, is defined by NDEc=E(Y1M0Y0M0C=c). The natural indirect effect, conditional on C=c, comparing the effect of the mediator at levels M1 and M0 while fixing the exposure at level 1, is defined by NIEc=E(Y1M1Y1M0C=c) (Robins and Greenland 1992; Pearl 2001).

Let X1X2X3 denote that X1 is independent of X2 conditional on X3. To identify natural direct and indirect effects, the following four confounding assumptions are sufficient. Conditioning on covariates C, there is no unmeasured confounding of (i) the exposure-outcome relationship (YaAC), (ii) the mediator-outcome relationship (YamMC), (iii) the exposure-mediator relationship (MaAC), and (iv) there are no mediator-outcome confounders affected by the exposure (YamMaC). Under these assumptions, we have the mediation formulas (Pearl 2001; VanderWeele 2015):

(1)NDEc=m{E(YA=1,M=m,C=c)E(YA=0,M=m,C=c)}P(M=mA=0,C=c),
(2)NIEc=mE(YM=m,A=1,C=c){P(M=mA=1,C=c)P(M=mA=0,C=c)}.

For a binary outcome, we are often interested in the natural direct and indirect effects on the odds ratio scale (VanderWeele 2015), defined as

(3)ORcNDE=P(Y1M0=1C=c)P(Y0M0=0C=c)P(Y1M0=0C=c)P(Y0M0=1C=c),
(4)ORcNIE=P(Y1M1=1C=c)P(Y1M0=0C=c)P(Y1M1=0C=c)P(Y1M0=1C=c).

Under the confounding assumptions (i)–(iv), we can identify ORcNDE and ORcNIE based on the following formula for a,a=0,1:

P(YaMa=1C=c)=m=0,1P(Y=1A=a,M=m,C=c)P(M=mA=a,C=c).

3 Results on direct and indirect effects with a binary mediator when the exposure is misclassified

In this section, we derive the relationships between the true natural direct and indirect effects and the naive estimated effects with a correctly measured binary mediator. We do not provide a closed form expression of the bias for cases with a continuous mediator but will propose an approach to correcting for bias with both binary and continuous mediators in the next section.

Let A* denote the observed exposure, we assume A* to be non-differentially misclassified conditional on the covariates C, i. e. P(AA,M,Y,C)=P(AA,C). This conditional non-differential misclassification assumption allows the misclassification probability to depend on the covariates. Therefore, it is weaker than the traditional non-differential misclassification assumption, which assumes that the misclassification probability does not depend on the covariates, i. e. P(AA,M,Y,C)=P(AA).

For a binary mediator, we do not need any parametric assumptions and we can write (1) and (2) as

(5)NDEc=(η11cη01c)δ0c+(η10cη00c)(1δ0c),
(6)NIEc=(η11cη10c)(δ1cδ0c),

where

δac=P(M=1A=a,C=c),ηamc=E(YA=a,M=m,C=c).

Because δac and ηamc are defined in terms of the true exposure A, they cannot be calculated from the observed data. We then define the observed analogue of δac by replacing A with A*, δac=P(M=1A=a,C=c).

The naive estimators use the observed exposure instead of the true exposure to calculate the direct and indirect effects, and we denote them by NDEcna and NIEcna, respectively. From (1) with A replaced by A*, we have

(7)NDEcna=(SNc+SPc1){(η11cη01c)δ1cδ0cδ1c+(η10cη00c)(1δ1c)(1δ0c)1δ1c},

where SNc=P(A=1A=1,C=c) and SPc=P(A=0A=0,C=c). We show the detailed proof of (7) in Appendix S1.1. Furthermore, if η11cη01cη10c+η00c=0, which is an average no-interaction assumption on the additive scale conditional on C=c, then we can simplify (7) as

NDEcna=NDEcδ1cδ0cδ1c(δ1c+δ0c1)δ1c(1δ1c)(SNc+SPc1).

In Appendix S1.1, we show that

|δ1cδ0cδ1c(δ1c+δ0c1)δ1c(1δ1c)(SNc+SPc1)|1.

Therefore, we can obtain that |NDEcna||NDEc|, i. e. the naive estimator will always estimate the true natural direct effect towards the null under the average no-interaction assumption conditional on C=c. For the natural indirect effect, we show in Appendix S1.2 that

(8)NIEcna={E(YA=1,M=1,C=c)E(YA=1,M=0,C=c)}(δ1cδ0c)
(9)={(η11cη01c)δ1cSNcδ1c+η01c(η10cη00c)(1δ1c)SNc1δ1cη00c}(δ1cδ0c)(SNc+SPc1).

Equation (9) is more complex than (7) for NDEcna, and does not have further implications without additional assumption. Under the average no-interaction assumption conditional on C=c, we show in Appendix S1.2 that

(10)NIEcna=(SNc+SPc1)NIEc+(η11cη01c)(δ1cδ0c)2SNc(1SNc)(SNc+SPc1)δ1c(1δ1c).

Unlike the natural direct effect, the naive estimator will not always estimate the true natural indirect effect towards the null. For example, suppose that η11c=η10c=0.6,η01c=η00c=0.4, SNc=SPc=3/4, δ1c=3/4, δ0c=1/4. Then there is no interaction and NIEc=0 and NIEcna=0.02>0; if instead η11c=η01c=0.6,η10c=η00c=0.4, and the other parameters remain unchanged, then NIEc=0.1 and NIEcna=0.05<0.1. Therefore, when the exposure is misclassified, the naive estimator is not always conservative for the true natural indirect effect. This is different from the conclusion in VanderWeele et al. (2012) concerning measurement error on the mediator. They find that the bias of the naive estimator can be only in one direction when the mediator is measured with error, while in our case, when the exposure is misclassified, the bias for the indirect effect can be in either direction.

In practice, if we have some prior knowledge, we can still obtain the direction of bias of the natural indirect effect. For example, if we know that SN' is very close to 1, then the second term in (10) is close to 0. Thus, we have |NIEcna||(SNc+SPc1)NIEc||NIEc|, i. e. the naive estimator estimates the true natural indirect effect towards the null.

To better understand the relationship between the true effects and the naive estimands, we give two artificial examples under two scenarios without covariates:

  1. 1.

    high specificity (SP = 0.9): η11=0.8,η10=0.6,η01=0.4 and η00 = 0.2,

  2. 2.

    low specificity (SP = 0.6): η11=0.6,η10=0.6,η01=0.7 and η00 = 0.3.

In these two scenarios, we choose P(A = 1) = 0.6, δ1 = 0.8, δ0 = 0.2 and vary SN from 0.5 to 1. We calculate NDE and NIE according to (5) and (6), and calculate NDEna and NIEna according to (7) and (10). Figure 1 shows the plots of NDEna and NIEna. From the plots, we can see that both NDEna and NIEna increase as SN increases. In addition, we have |NDEna||NDE|, which is consistent with our theoretical results. In contrast, both |NIEna||NIE| and |NIEna||NIE| are possible in the plots.

Figure 1: Plots of NDEna and NIEna when SN varies from 0.5 to 1. The solid line is NDEna and the dashed line is NIEna.

Figure 1:

Plots of NDEna and NIEna when SN varies from 0.5 to 1. The solid line is NDEna and the dashed line is NIEna.

The relationship between the true effects and the naive estimators allows for drawing qualitative conclusions on the risk difference scale. Moreover, Jiang et al. (2015) showed that a positive causal effect on the log odds ratio scale implies that the effect is positive on the risk ratio scale, which further implies that the effect is positive on the risk difference scale. Therefore, our result is also useful for drawing qualitative conclusions on the risk ratio and odds ratio scales. For example, if the naive estimator of the natural direct effect on the odds ratio scale is larger than 1, then we know that the naive estimator on the risk difference scale is positive. Therefore, if there is no interaction on the additive scale, we then have |NDEcna||NDEc|, and thus we can obtain that the true natural direct effect on the risk difference scale is positive. As a result, we can conclude that the true natural direct effect on the odds ratio scale is larger than 1. In the application section, we will present the results on both the odds ratio and risk difference scales.

4 Correction approach for direct and indirect effect estimators

Misclassification of the true exposure A can be seen as a type of missing data problem. In this section, we propose an approach based on the Expectation-Maximization (EM) algorithm (Dempster et al. 1977) to correcting the bias result from the misclassified exposure. We will focus on the cases with binary mediators and comment on the cases with continuous mediators in the end of this section. The approach uses the EM algorithm to obtain the MLEs of the true direct and indirect effects. It requires SNc and SPc to be known. It also requires us to specify the distributions of A, M and Y. This is different from mediation analysis without exposure misclassification, where the distribution of A is not needed and the expectation of Y is needed instead of its distribution. In some cases, we can avoid imposing parametric assumptions by stratifying on the covariates. However, when there are too many covariates, stratification is not possible and we need to assume parametric models for A, M and Y. For the binary exposure A, we assume the following logistic regression model:

logit{P(A=1C=c)}=γc,

where γ is estimated from the observed data. For binary mediators, we assume a logistic regression model. For binary outcomes, we assume a logistic regression model, and for continuous outcomes, we assume a linear regression model. After implementing the approach, we can obtain the MLEs of the parameters in the regression models. By plugging the estimated parameters into (1) and (2) or (3) and (4), we then obtain the maximum likelihood estimators of the direct and indirect effects on the risk difference or odds ratio scale.

Before introducing the correction approach, we need to show the validity of the EM algorithm. That is, the joint distribution of (A,M,Y,C) is identifiable from the observed data if the sensitivity and specificity parameters are known. From the law of total probability, we have

(11)P(A=1,M,Y,C)=P(A=1,A=1,M,Y,C)+P(A=0,A=1,M,Y,C)=P(A=1,M,Y,C)SNc+P(A=0,M,Y,C)(1SPc),

where the last equality follows from P(AA,M,Y,C)=P(AA,C). Similarly, we can obtain

(12)P(A=0,M,Y,C)=P(A=1,M,Y,C)(1SNc)+P(A=0,M,Y,C)SPc.

The probabilities on the left hand side of (11) and (12) can be identified from the observed data. When SNc+SPc>1, which is reasonable in most real problems, we can uniquely solve P(A,M,Y,C) from (11) and (12). Therefore, the joint distribution of (A,M,Y,C) is identifiable. As a result, we can use the EM algorithm to obtain the MLEs of the parameters. Note that the identification results do not require any parametric models and thus the EM algorithm is valid for both continuous and binary mediators and outcomes.

Let γ, β and θ denote the parameters of the exposure model, the mediator model and the outcome model, respectively. To implement the EM algorithm, we can write the complete-data log-likelihood as

(γ,β,θ)=i=1n{AC(γAi,Ci)+AA,C(Ai,Ai,C)+MA,C(βMi,Ai,Ci)+YA,M,C(θAi,Mi,Yi,Ci)},

where AC, AA,C, MA,C and YA,M,C are log-likelihoods of the exposure model, the misclassification model, the mediator model and the outcome model, respectively. Because SNc and SPc are pre-specified, we can ignore the term AA,C in the EM algorithm. As a remark, for a binary outcome, θ includes only the regression coefficients in the models, but for a continuous outcome, it consists of both the regression coefficients and the variance of the error term.

The EM algorithm consists of two steps, the E-step and the M-step. At the E-step, we calculate the expectation of the log-likelihood conditional on the current parameters and the observed data (A,M,Y,C), denoted by by Q(γ,β,θγ(k),β(k),θ(k)), where γ(k),β(k),θ(k) are the estimates of the parameters after the k-th iteration of the M-step. We then have

(13)Q(γ,β,θγ(k),β(k),θ(k))=i=1na=0,1hia(γ(k),β(k),θ(k))AC(γAi=a,Ci)+i=1na=0,1hia(γ(k),β(k),θ(k))MA,C(βMi,Ai=a,Ci),+i=1na=0,1hia(γ(k),β(k),θ(k))YA,M,C(θAi=a,Mi,Yi,Ci)

for a = 0, 1, where

hia(γ(k),β(k),θ(k))=P(Ai=aAi,Mi,YiCi)=P(Ai=aCi)P(AiAi=a,Ci)P(MiAi=a,Ci)P(YiAi=a,Mi,Ci)a=0,1P(Ai=aCi)P(AiAi=a,Ci)P(MiAi=a,Ci)P(YiAi=a,Mi,Ci),

and P(Ai=a|Ci), P(MiAi=a,Ci) and P(YiAi=a,Mi,Ci) are the likelihood functions for the linear model or the logistic regression model. Note that P(AiAi,Ci) depends on the pre-specified SNc and SPc. In practice, we can set a plausible range of values for them to conduct sensitivity analysis. At the M-step, we update the parameters by maximizing Q(γ,β,θγ(k),β(k),θ(k)). From (13), we can maximize the log-likelihoods for the three regression models separately using weighted linear or logistic regression. The corrected coefficients from the mediator and outcome models immediately give the corrected natural direct and indirect effects.

For a continuous mediator, our correction approach is also applicable. We can replace MA,C(βAi,Mi,Ci) with the likelihood function for the linear model and follow the same procedure to obtain the corrected natural direct and indirect effects.

5 A simulation study

We conducted a simulation study to evaluate the finite sample performance of our correction approach for binary mediators. We also conducted a simulation study for continuous mediators in Appendix S2.1. Denote expit(x)=1/(1+exp(x)). We did not consider covariates in the simulation study. First, we generated A from a Bernoulli distribution with probability expit(γ), and generated A* from Bernoulli distributions with conditional probabilities SN and 1SP. We then generated the mediator and outcome according to the following four scenarios.

  1. 1.

    Binary M and continuous Y: MBernoulli(logit(β0+β1A)) and YN(θ0+θ1A+θ2M+θ3AM,1);

  2. 2.

    Binary M and Y: MBernoulli(logit(β0+β1A)) and YBernoulli(logit(θ0+θ1A+θ2M+θ3AM)).

We chose two sets of true values for γ, β and θ to include cases with and without exposure-mediator interaction in the outcome model.

  1. 1.

    Without interaction: γ = 0.1, β=(0.1,0.5), θ=(0.1,0.5,0.5,0);

  2. 2.

    With interaction: γ = 0.1, β=(0.1,0.5), θ=(0.1,1,1,0.5).

We calculated the true values of the NDE and NIE from these values of the parameters. For the scenarios with binary outcomes, we focused on the NDE and NIE on the odds ratio scale instead of the risk difference scale. We chose four levels of sample sizes, N = 200, 1000, 5000, 10000, and three levels of sensitivity parameters: (1) SN=SP=0.80; (2) SN=SP=0.85; (3) SN=SP=0.90. For each case, we calculated the biases and root mean square errors (RMSEs) of the NDE and NIE using our correction approach as well as the naive NDE and NIE.

Table 1 presents the simulation results without exposure-mediator interaction. When the outcome is continuous, the approach can eliminate the bias for a sample size of N = 200. However, when the outcome is binary, a sample size of N = 200 is not enough for the approach to eliminate the bias, and we might need sample sizes as large as N = 5000 to get more accurate estimates. In general, the approach estimates the natural indirect effect with greater accuracy than the natural direct effect. The RMSEs tend to increase as the probability of misclassification increases.

Table 1:

Biases and RMSEs (in parentheses) of the EM algorithm correction approach for binary mediators when sample sizes are N = 200, 1000, 5000, 10000, with three levels of sensitivity parameters in the absence of exposure-mediator interaction. For scenarios with binary outcomes, the NDE and NIE are on the odds ratio scale, and we also give the estimates for the total effect (TE). In the table, “con” stands for continuous and “bin” stands for binary. The user time for 100 replications of all the scenarios is 192.8s at a laptop with 2.3 GHz Intel Core i7.

Bin M & con YN = 200N = 1000N = 5000N = 10000
NDE=0.5, NIE=0.06
SP=SN=0.90NDE–0.004 (0.174)0.000 (0.081)0.002 (0.037)0.000 (0.025)
NIE0.001 (0.054)0.000 (0.024)0.000 (0.010)0.001 (0.007)
SP=SN=0.85NDE0.000 (0.199)0.003 (0.093)0.002 (0.040)0.002 (0.030)
NIE–0.004 (0.062)–0.001 (0.025)0.001 (0.012)0.001 (0.008)
SP=SN=0.80NDE–0.026 (0.231)–0.007 (0.100)0.002 (0.045)0.002 (0.033)
NIE0.000 (0.077)0.001 (0.029)0.001 (0.013)0.001 (0.009)
Naive estimatorNDE–0.198 (0.248)–0.201 (0.211)–0.203 (0.205)–0.203 (0.204)
when SP=SN=0.80NIE–0.019 (0.044)–0.021 (0.027)–0.021 (0.023)–0.021 (0.022)
Bin M & bin Y
NDE=1.65, NIE=1.06, TE=1.75 (OR)
SP=SN=0.90NDE0.156 (0.867)0.037 (0.313)0.015 (0.132)0.003 (0.080)
NIE–0.004 (0.088)–0.002 (0.031)0.000 (0.014)0.001 (0.010)
TE0.152 (0.925)0.037 (0.337)0.017 (0.143)0.005 (0.098)
SP=SN=0.85NDE0.350 (1.219)0.028 (0.342)0.010 (0.143)0.009 (0.102)
NIE–0.007 (0.096)0.000 (0.035)0.000 (0.015)0.000 (0.011)
TE0.343 (1.213)0.030 (0.374)0.010 (0.156)0.010 (0.111)
SP=SN=0.80NDE0.317 (1.617)0.061 (0.402)0.012 (0.174)0.014 (0.122)
NIE–0.012 (0.128)–0.001 (0.042)0.016 (0.018)0.017 (0.013)
TE0.293 (0.889)0.063 (0.430)0.005 (0.189)0.008 (0.111)
Naive estimatorNDE–0.233 (1.639)–0.289 (0.347)–0.300 (0.310)–0.302 (0.307)
when SP=SN=0.80NIE–0.022 (0.052)–0.022 (0.029)–0.022 (0.023)–0.021 (0.022)
TE–0.278 (0.551)–0.336 (0.390)–0.346 (0.356)–0.348 (0.353)

Table 2 presents the simulation results in the presence of exposure-mediator interaction. The results are similar to those in Table 1 except for two aspects. First, the RMSEs are larger than the case without exposure-mediator interaction under the same setting. Second, when the mediator and outcome are both binary, the approach may not be able to obtain a accurate estimate for the direct effect even for a sample size of N = 5000. The RMSEs in this case are still very large even when the sample size goes to N = 10000. However, when the sample size is above 100,000, our correction approach performs well even with binary mediator and outcome (results shown in Appendix S2.2).

Table 2:

Biases and RMSEs (in parentheses) of the EM algorithm correction approach for binary mediators when sample sizes are N = 200, 1000, 5000, 10000, with three levels of sensitivity parameters in the presence of exposure-mediator interaction. For binary outcomes, the NDE and NIE are on the odds ratio scale, and we also give the results for the total effect (TE). In the table, “con” stands for continuous and “bin” stands for binary. The user time for 100 replication of all the scenarios is 103.4s at a laptop with 2.3 GHz Intel Core i7.

Bin M & con YN = 200N = 1000N = 5000N = 10000
NDE=1.26, NIE=0.18
SP=SN=0.90NDE–0.001 (0.177)0.001 (0.075)0.001 (0.034)0.001 (0.025)
NIE–0.005 (0.131)–0.005 (0.055)–0.001 (0.025)–0.001 (0.018)
SP=SN=0.85NDE–0.013 (0.192)0.003 (0.085)0.000 (0.036)0.001 (0.026)
NIE–0.001 (0.142)–0.004 (0.066)–0.004 (0.028)–0.004 (0.021)
SP=SN=0.80NDE–0.016 (0.210)–0.004 (0.089)0.001 (0.040)0.002 (0.028)
NIE–0.022 (0.160)–0.009 (0.071)–0.008 (0.032)–0.006 (0.023)
Naive estimatorNDE–0.500 (0.526)–0.509 (0.514)–0.506 (0.507)–0.506 (0.507)
when SP=SN=0.80NIE–0.073 (0.126)–0.073 (0.086)–0.072 (0.075)–0.072 (0.073)
Bin M & bin Y
NDE=4.04, NIE=1.15, TE=4.64(OR)
SP=SN=0.90NDE0.281 (4.485)0.331 (1.430)–0.002 (0.491)–0.015 (0.346)
NIE–0.008 (0.127)0.002 (0.056)0.002 (0.025)0.000 (0.018)
TE0.280 (6.406)0.402 (1.706)0.008 (0.590)–0.015 (0.417)
SP=SN=0.85NDE0.117 (4.721)0.411 (1.846)0.003 (0.579)–0.002 (0.411)
NIE0.004 (0.170)0.001 (0.064)0.002 (0.028)0.001 (0.019)
TE0.173 (7.303)0.497 (2.205)0.015 (0.703)0.002 (0.494)
SP=SN=0.80NDE–0.155 (5.206)0.515 (2.408)0.047 (0.706)–0.015 (0.462)
NIE–0.007 (0.198)0.000 (0.074)0.002 (0.033)0.001 (0.023)
TE–0.185 (7.883)0.612 (2.879)0.066 (0.850)–0.011 (0.555)
Naive estimatorNDE–1.392 (3.737)–1.730 (1.796)–1.771 (1.783)–1.786 (1.792)
when SP=SN=0.80NIE–0.069 (0.114)–0.068 (0.078)–0.067 (0.069)–0.067 (0.068)
TE–1.772 (5.385)–2.142 (2.210)–2.186 (2.198)–2.202 (2.208)

We then compared our methods with two widely used approaches, method of moments and regression calibration (Carroll et al. 2012; Fuller 2009). The method of moments is based on the asymptotic biases of the estimated coefficients from the regression models using the observed exposure. By solving the system of equations that arises from the asymptotic biases, we can obtain the method of moments estimators of β and θ. To obtain the regression calibration estimators, we substituted the observed exposure in the naive mediator and outcome regression with the best linear predictor of the latent true exposure given A*, M and C. After obtaining the estimators of β and θ, we then calculated the NDE and the NIE. Table 3 presents the simulation results of our approach and regression calibration for cases with binary mediators and outcomes. Because we cannot obtain the asymptotic biases for the coefficients in logistic regression models under misclassification, we do not give the result for the method of moments in Table 3. In Table S3 of the appendix, we show the comparison of our approach with the method of moments for cases with continuous mediators and outcomes. From these results, we can see that our approach is better than the other two approaches in all the cases.

Table 3:

Biases and RMSEs (in parentheses) of EM algorithm and regression calibration for binary (bin) mediators and outcomes with and without interaction (int) when SN=SP=0.8.

No int & bin M & bin YNDE=1.65, NIE=1.06(OR)
N = 200N = 1000N = 5000N = 10000
EM correctionNDE0.317 (1.617)0.061 (0.402)0.012 (0.174)0.014 (0.122)
NIE–0.012 (0.128)–0.001 (0.042)0.002 (0.018)0.001 (0.013)
Naive estimatorNDE–0.233 (0.514)–0.289 (0.347)–0.300 (0.310)–0.302 (0.307)
NIE–0.022 (0.052)–0.022 (0.029)–0.022 (0.023)–0.021 (0.022)
Regression calibrationNDE0.175 (1.040)–0.003 (0.387)–0.055 (0.177)–0.062 (0.130)
NIE0.006 (0.159)0.021 (0.058)0.029 (0.036)0.028 (0.032)
Int & bin M & bin YNDE=4.04, NIE=1.15(OR)
EM correctionNDE0.155 (5.206)0.515 (2.408)0.047 (0.706)–0.015 (0.462)
NIE–0.007 (0.198)0.000 (0.074)0.002 (0.033)0.001 (0.023)
Naive estimatorNDE–1.392 (3.737)–1.730 (1.796)–1.771 (1.783)–1.786 (1.792)
NIE–0.069 (0.114)–0.068 (0.078)–0.067 (0.069)–0.067 (0.068)
Regression calibrationNDE0.120 (5.166)0.503 (3.451)0.617 (1.763)0.425 (0.984)
NIE–0.378 (0.585)–0.247 (0.382)–0.173 (0.220)–0.163 (0.192)

6 Example

We apply our correction approach to a perinatal epidemiological study using National Center for Health Statistics (NCHS) birth certificate data of all US births in 2003. The causal mechanisms by which the smoking status is related to preterm birth are not entirely known and a potential intermediate variable is pre-eclampsia. However, because the smoking status of a person is self-reported, it is typically subject to misclassification.

We carry out mediation analysis to calculate the natural indirect effect of the smoking status on the preterm birth, mediated by pre-eclampsia, and also the natural direct effect through other pathways independent of pre-eclampsia. The total sample size is 3, 918, 542. Because calculating the weights in the E-step is computationally intensive, we use a random sample of 500,000 individuals and adjust for the variables which may confound smoking and preterm birth relationship, smoking and pre-eclampsia relationship, and pre-eclampsia and preterm birth relationship. We assume that there are no confounders of the pre-eclampsia and preterm birth relationship that are affected by smoking. As potential confounders C, we include mother’s ethnicity, age, marital status, drinking status and education level. In addition to the smoking status, misclassification of other variables such as drinking status and pre-eclampsia is also possible in our data. Some methods in the previous literature can be applied to deal with the misclassification of other variables (VanderWeele et al. 2012; Carroll et al. 2012). However, because we consider exposure misclassification in our theory, we focus only on the misclassification of smoking status in our example. The data analysis is intended only for illustrative purposes to give an example of the exposure misclassification results. We therefore do not consider misclassification of other variables.

The traditional non-differential misclassification assumption is not plausible in the data because the misclassification probability may depend on the covariates. For example, the education level of a mother is probably related to whether she would misreport her smoking status. However, after controlling for the covariates, it is plausible that the misclassification probability is independent of the mediator and the outcome.

We use the following models to estimate the NDE and NIE,

logit{P(A=1C=c)}=γc,logit{P(M=1A=a,C=c)}=β0+β1a+β2c,logit{P(Y=1A=a,M=m,C=c)}=θ0+θ1a+θ2m+θ3am+θ4c.

The model of Y allows for the interaction of A and M. We calculate the NDE and NIE by first conditioning on c and then averaging over c. If we ignore the misclassification and treat the observed exposure as the true exposure, the estimated natural direct effect on the odds ratio scale is 1.206 (95 % CI: (1.176,1.251)), and the estimated natural indirect effect is 0.988 (95 % CI: (0.985,0.992)). The naive estimators suggest a small protective effect through pre-eclampsia but a much larger detrimental direct effect through other pathways.

Next, we use our correction approach to estimating the natural direct and indirect effects under different values of sensitivity and specificity parameters. Generally, mothers are more likely to misreport their smoking status when they are smoking and are less likely to misreport when not smoking (Boyd et al. 1998) (i. e. high specificity but low sensitivity). In addition, it is reasonable to assume SNc1SPc, which means that the probability of reporting smoking when a mother is actually smoking is no less than that when a mother is not smoking. In our analysis, for simplicity, we set the values of SNc and SPc to be independent of c, i. e. SNc=SN and SPc=SP. We can also choose the values of SNc and SPc to depend on c, but for the purposes of this analysis, since we do not have prior knowledge about the relations of the missing probabilities in different subpopulations, we thus do not consider this in our analysis. From the conditional non-differential misclassification assumption and the law of total probability, we have

P(A=1M,Y,C=c)=P(A=1A=1,C=c)P(A=1M,Y,C=c)+P(A=1A=0,C=c)P(A=0M,Y,C=c)=SNP(A=1M,Y,C=c)+(1SP){1P(A=1M,Y,C=c)}=(SN+SP1)P(A=1M,Y,C=c)+1SP1SP,

which leads to SP1P(A=1M,Y,C=c). From the observed data, we find that P(A=1M=m,Y=y,C=c)0.03 in some subpopulations. Thus, the specificity should be larger than 0.97 and we choose two levels for SP, 0.97, 0.99 and vary SN from 0.7 to 1. Note that we may have different lower bounds on the sensitivity parameter if we allow for different specificity parameters for different subpopulations.

Figure 2 shows the natural direct and indirect effects on the odds ratio scale when SP=0.97 and SP=0.99. As SN increases from 0.7 to 1, the estimate of the natural direct effect decreases and the estimate of the natural indirect effect increases. However, in both plots, the corrected estimates do not change substantially. Due to the large sample size of 500,000, it will be very time consuming to implement the full sample bootstrap to obtain the confidence intervals. We implement the subsample bootstrap (Dimitris and Joseph 1994) to obtain the confidence intervals for the following four cases: (1) SN=0.7,SP=0.97; (2) SN=1,SP=0.97; (3) SN=0.7,SP=0.99; (4) SN=1,SP=0.97. In Table 4, we display the results of the subsample bootstrap. The results depend on the values of the sensitivity and specificity parameters considered. However, for all the four cases, the natural direct effect is detrimental and the natural indirect effect on the odds ratio scale is protective but small. The effect of smoking status on preterm birth is primarily through pathways other than pre-eclampsia status. The conclusion seems relatively robust to misclassification.

Figure 2: Sensitivity analysis for the NCHS birth certificate data. The effects are on the odds ratio scale. The solid lines are the effects when SP = 0.97 and the dotted lines are the effects when SP = 0.99.

Figure 2:

Sensitivity analysis for the NCHS birth certificate data. The effects are on the odds ratio scale. The solid lines are the effects when SP = 0.97 and the dotted lines are the effects when SP = 0.99.

Table 4:

Estimates and confidence intervals (in parentheses) of the natural direct and indirect effects (on the odds ratio scale) for the NCHS birth certificate data under different values of sensitivity and specificity parameters.

NDENIE
SN=0.7,SP=0.971.252 (1.202, 1.334)0.981 (0.975, 0.986)
SN=1.0,SP=0.971.237 (1.197, 1.303)0.982 (0.976, 0.986)
SN=0.7,SP=0.991.230 (1.191, 1.278)0.996 (0.989, 0.997)
SN=1.0,SP=0.991.218 (1.183, 1.272)0.986 (0.981, 0.990)

7 Discussion

In this paper, we study the problem of misclassification of a binary exposure in the context of causal mediation analysis with a binary mediator, allowing for the presence of exposure-mediator interaction. We show that when conditional non-differential misclassification of a binary exposure is ignored, in the absence of additive exposure-mediator interaction, the estimator estimates the natural direct effect towards the null. We deduce formulas for the biases of the naive estimators of the natural direct and indirect effects when the mediator is binary.

Our results require that there are no post-exposure mediator-outcome confounders. However, this assumption is often violated in applied mediation settings. Without this assumption, the randomized interventional direct and indirect effects can be identified by (1) and (2) under the confounding assumptions (i)–(iii) in Section 2 (Didelez et al. 2006; VanderWeele 2015). Therefore, our approach can be generalized to these randomized interventional effects. However, if there are mediator-outcome confounders affected by the exposure, then identification formulas for the interventional effects are different and thus this would require additional technical development when assessing the consequences of misclassification.

We propose a correction approach based on the EM algorithm as a possible strategy of correction for misclassification. The correction strategy works well for moderately large sample sizes except for the direct effect when both the mediator and outcome are binary and there is exposure-mediator interaction. In this setting, much larger sample sizes are needed. Applying the approach to the NCHS birth certificate data, we find that the smoking status affects the preterm birth principally through pathways other than pre-eclampsia but there may be a small protective effect mediated by pre-eclampsia. Our conclusion seems fairly robust for the presence of misclassification in the exposure.

In our bias correction approach, we treat the true binary exposure as missing data and deal with the mixture model likelihood. A general technique for finding maximum likelihood estimators in these models is the EM algorithm. Because we are dealing with a low-dimensional problem, in the sense that we have only two latent clusters, the convergence of the EM algorithm is not very slow. In the simulation study, we compared our correction approach with two alternative approaches, the method of moments and the regression calibration, which do not require iteration. Our approach can get more accurate and precise estimates but is more computationally intensive than the two other approaches. In addition, the method of moments estimator does not require the model for P(A=1C=c), which is more robust.

Although our theory allows for the misclassification probabilities to depend on the covariates, we may have some practical issues when applying our correction approach. First, we will often not have subject knowledge about how the misclassification probabilities depend on the covariates. Second, we may have too many sensitivity and specificity parameters to easily interpret when we allow for the misclassification probabilities to depend on the covariates. Therefore, when there is no subject knowledge about the misclassification, it may be easier to use the same value for the misclassification probabilities to gain at least some understanding of the potential role of misclassification.

Funding statement: This work was supported by the National Institute of Environmental Health Sciences (Funder Id: http://dx.doi.org/10.13039/100000066, Grant Number: ES017876).

Acknowledgment

We thank the Associate Editor and two reviewers for helpful comments.

References

Aigner, D. J. (1973). Regression with a binary independent variable subject to errors of observation. Journal of Econometrics, 1:49–59. Search in Google Scholar

Baron, R. M., and Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51:1173–1182. Search in Google Scholar

Boyd, N. R., Windsor, R. A., Perkins, L. L., and Lowe, J. B. (1998). Quality of measurement of smoking status by self-report and saliva cotinine among pregnant women. Maternal and Child Health Journal, 2:77–83. Search in Google Scholar

Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2012). Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton, Florida: Chapman and Hall/CRC Press. Search in Google Scholar

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39:1–38. Search in Google Scholar

Didelez, V., Dawid, P., and Geneletti, S. (2006). Direct and indirect effects of sequential decisions. In: Proceedings of the 22nd Annual Conference for Uncertainty in Artificial Intelligence, 138–146. Arlington, VA: AUAI Press. Search in Google Scholar

Dimitris, N. P., and Joseph, P. R. (1994). Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22:2031–2050. Search in Google Scholar

Freeman, R. B. (1983). Longitudinal analyses of the effects of trade unions. Journal of Labor Economics, 2:1–26. Search in Google Scholar

Fuller, W. A. (2009). Measurement Error Models, volume 305. New York: John Wiley & Sons. Search in Google Scholar

Hoyle, R. H., and Kenny, D. A. A. (1999). Sample size, reliability, and tests of statistical mediation. Statistical Strategies for Small Sample Research, 1:195–222. Search in Google Scholar

Imai, K., Keele, L., and Tingley, D. (2010). A general approach to causal mediation analysis. Psychological Methods, 15:309–334. Search in Google Scholar

Jiang, Z., Ding, P., and Geng, Z. (2015). Qualitative evaluation of associations by the transitivity of the association signs. Statistica Sinica, 25:1065–1079. Search in Google Scholar

le Cessie, S., Debeij, J., Rosendaal, F. R., Cannegieter, S. C., and Vandenbroucke, J. P. (2012). Quantification of bias in direct effects estimates due to different types of measurement error in the mediator. Epidemiology, 23:551–560. Search in Google Scholar

Pearl, J. (2001). Direct and indirect effects. In: Proceedings of the Seventeenth Conference on Uncertainty and Artificial Inteligence, 411–420. San Francisco, CA: Morgan Kaufmann. Search in Google Scholar

Robins, J. M., and Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3:143–155. Search in Google Scholar

Savoca, E. (2000). Measurement errors in binary regressors: An application to measuring the effects of specific psychiatric diseases on earnings. Health Services and Outcomes Research Methodology, 1:149–164. Search in Google Scholar

Valeri, L., Lin, X., and VanderWeele, T. J. (2014). Mediation analysis when a continuous mediator is measured with error and the outcome follows a generalized linear model. Statistics in Medicine, 33:4875–4890. Search in Google Scholar

Valeri, L., and VanderWeele, T. J. (2014). The estimation of direct and indirect causal effects in the presence of misclassified binary mediator. Biostatistics, 15: 498–512. Search in Google Scholar

VanderWeele, T. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. New York: Oxford University Press. Search in Google Scholar

VanderWeele, T. J., Valeri, L., and Ogburn, E. L. (2012). The role of measurement error and misclassification in mediation analysis. Epidemiology, 23:561–564. Search in Google Scholar

VanderWeele, T. J., and Vansteelandt, S. (2009). Conceptual issues concerning mediation, interventions and composition. Statistics and Its Interface, 2:457–468. Search in Google Scholar

VanderWeele, T. J., and Vansteelandt, S. (2010). Odds ratios for mediation analysis for a dichotomous outcome. American Journal of Epidemiology, 172:1339–1348. Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (DOI:https://doi.org/10.1515/em-2016-0006).

Received: 2016-03-02
Revised: 2019-08-10
Accepted: 2019-10-31
Published Online: 2019-11-29

© 2019 Walter de Gruyter GmbH, Berlin/Boston