On the bias of adjusting for a non di ﬀ erentially mismeasured discrete confounder

: Biological and epidemiological phenomena are often measured with error or imperfectly captured in data. When the true state of this imperfect measure is a confounder of an outcome exposure relationship of interest, it was previously widely believed that adjustment for the mismeasured observed variables provides a less biased estimate of the true average causal e ﬀ ect than not adjusting. However, this is not always the case and depends on both the nature of the measurement and confounding. We describe two sets of conditions under which adjusting for a non - deferentially mismeasured proxy comes closer to the unidenti ﬁ able true average causal e ﬀ ect than the unadjusted or crude estimate. The ﬁ rst set of conditions apply when the exposure is discrete or continuous and the confounder is ordinal, and the expectation of the outcome is monotonic in the confounder for both treatment levels contrasted. The second set of conditions apply when the exposure and the confounder are categorical ( nominal ) . In all settings, the mismeasure ment must be non - di ﬀ erential, as di ﬀ erential mismeasurement, particularly an unknown pattern, can cause unpredictable results.


Introduction
In observational studies, it is often of interest to estimate the average causal effect of an exposure on an outcome when this relationship is confounded. The confounder may be non-differentially mismeasured, resulting in the observation of a proxy. Non-differentially mismeasurement means that the proxy is conditionally independent of exposure and the outcome given the true confounder.
For binary true confounder and proxy, Greenland [1] argued that adjusting for the proxy produces a partially adjusted measure of the average causal effect of the exposure on the outcome that is between the crude (i.e., unadjusted) and the true (i.e., adjusted for the true confounder) measures. Thus, the partially adjusted measure comes closer to the unidentifiable true measure than the crude one. Ogburn and Van-derWeele [2] showed that this result does not always hold. For binary exposure, true confounder, and proxy, they also showed that the result holds if the conditional expectation of the outcome is monotonic in the true confounder, i.e., it is either non-decreasing or non-increasing in the confounder for both exposure levels contrasted. Unfortunately, the condition cannot be tested explicitly because the true confounder is unobserved. Ogburn and VanderWeele [3] extended these results to binary exposure and ordinal true confounder and proxy, still under the monotonicity assumption. For binary exposure, true confounder, and proxy, Peña [4] showed that monotonicity in the true confounder holds if monotonicity in the proxy exists, which can be tested empirically.
Although monotonicity may seem like a strong condition, Ogburn and VanderWeele [2] argued that it is likely to hold in most applications in epidemiology. However, it is not difficult to come up with examples that are not monotonic. Ogburn and VanderWeele [2] gave an example of the causal effect of type 2 diabetes on hypertension when confounded by the use of thiazide, a drug to treat hypertension. The drug is known to lower the risk of hypertension in the general population, but it is also known to increase blood glucose levels increasing the risk of hypertension for patients with diabetes. Thus, the drug may have a nonmonotone effect by causing harm in patients with diabetes, while being protective for all other patients. Monotonicity is not a necessary condition for improved partial adjustment. Peña [4] characterized nonmonotonic cases for binary exposure, true confounder, and proxy, where the partially adjusted average causal effect is still between the crude and the true ones.
We provide new results for both monotonic and non-monotonic settings. For the monotonic settings, we show that the proof of the result in ref. [3] also applies when the exposure is non-binary under additional assumptions.

Preliminaries
Let A be the exposure of interest, and Y the outcome, which are confounded by the unobserved C. Let D be the non-differentially mismeasured proxy of . The causal diagram depicting this setting is in the left most panel of Figure 1. We use upper-case letters to denote random variables, and the same letters in lower-case to denote fixed (e.g., by observation or by intervention) values. For simplicity, we use p() to denote both probability distributions and density functions.
We can now more clearly define monotonicity.
This is easily extended to ordinal C. Given an exposure level a and a, and for all level of C, such that i j < if: is monotonic in C, and likewise for p a C ( | ). For ordinal C and D, Ogburn and VanderWeele [3] introduced the concept of tapered misclassification probabilities: is said to be a tapered distribution. Intuitively, this means that, for any level i of C or D, the probability of correct classification into level i is equal or greater than the probability of misclassification into any other level and, moreover, the latter is non-increasing in each direction away from i.
Similar to a tapered misclassification, for categorical C and D, p D C ( | ) is a preferential distribution if there is a permutation π D of the levels of D such that / − . Note that p D and q D do not depend on either c or d. If p d c p D ( | ) = , then we say that the level c prefers the level d or, equivalently, that the level d is preferred by the level c. We denote such levels of C and D by c d and d c , respectively. That p A C ( | ) is preferential is defined likewise, with π A , p A , q A , c a , and a c defined accordingly. Preferential distributions resemble tapered distributions, but they differ in that they apply to categorical variables, and the misclassification probabilities are all equal. For example, say that we are interested in the average causal effect of neighborhood (A) on math achievements (Y ) when confounded by parental education (C). Although we do not know the parents' education, we do know their occupation (D). If individuals whose parents have education c prefer to live in the neighborhood a c and have no preferences for the other neighborhoods and parents with education c prefer the occupation d c and have no preferences for the other occupations, then p A C ( | ) and p D C ( | ) are preferential distributions. When C is sufficient for confounding control (as in the left most panel of Figure 1), the true causal average treatment effect, or risk difference, between two exposure levels a and a is identifiable if C is observed. Let Y a denote the potential outcome for a given subject, had that subject received exposure level A a = . The true causal risk difference is then defined as: When C is unobserved, RD true is not identifiable. However, it can be approximated by the unadjusted or crude risk difference: and by the partially adjusted or observed risk difference: We may be interested in standardizing other Cand D-conditional association measures between A and Y than the expectation. For instance, we may be interested in the probability that the potential outcome is greater than a given value. That is, we may be interested in the cumulative distribution function

Monotonic settings
We wish to know which of the two approximations comes closer to the unidentifiable true causal quantity. This section answers the question under some assumptions. is monotonic in C, because the proof does not depend on the association measure being standardized. That F y for all i j < . When the former inequality applies to any fixed value y, it is known as first-order stochastic dominance of F y Y a C i , , and it is then a sufficient (but not necessary) condition for E Y a C i , The following theorem answers our original question by showing that RD obs is a better approximation to RD true than RD crude . The theorem extends a similar result in ref. [

≥
, and the latter inequality is empirically testable. Note the assumption about p a C ( | ) in the theorem. This is unnecessary in the result in ref. [3] because, when A is binary, the assumption about p a C ( | ) implies the assumption about p a C ( | ). Although the proofs of both results are similar, we include the proof of our result in the Appendix because the proof in ref. [3] is rather brief and lacks important details, such as the requirement that p A D , ( )is strictly positive.
Like Theorems 1 and 2 carries over to differences between standardized conditional CDFs if F y Y a C , ( ) | is monotonic in C, which allows us to sort the true, crude and observed CDF differences between two exposure levels. Finally, Theorem 2 holds in fact for any effect measure that can be written as , where g() and h() are monotonic functions. Therefore, the theorem holds for, for instance, the causal risk ratio and causal odds ratio.

Non-monotonic settings
In this section, we do not assume that is monotonic in C for the two exposure levels contrasted. Instead, we identify alternative assumptions that still allow us to conclude that RD obs lies between RD true and RD crude and, thus, that RD obs comes closer to the unidentifiable RD true than RD crude . Specifically, consider again the causal graph to the left in Figure 1.
Let Y be a discrete or continuous random variable. Let A, C, and D be categorical random variables with levels K 1, 2, , … . Let p C ( ) be a discrete uniform distribution, and let p A C ( | ) and p D C ( | ) be preferential distributions. Then, for any two exposure levels a and a, RD obs lies between RD true and RD crude .
Note that the theorem also allows us to conclude whether RD obs is an upper or a lower bound of the unidentifiable RD true . That is, RD RD obs true ≥ if and only if RD R D crude obs ≥ , and the latter inequality is empirically testable.
The following corollary shows that, as expected, the bias of RD obs decreases with increasing quality of D as a proxy of C (except when p 1 A = because, in that case, Y is conditionally independent of D given A and, thus, RD obs reduces to RD crude ). Counterexamples to Theorem 3 can be constructed when the conditions are violated, via the code provided at: https://www.dropbox.com/s/pv2y84f4w38be8i/necessityBinary.R?dl=0 for binary variables, and https://www.dropbox.com/s/5psjqujfqj6v5d9/necessityTernary.R?dl=0 for ternary variables. Note that one must rely on knowledge external to the observed data to verify the conditions in the theorem, since C is unobserved. The following corollary partially alleviates this, since the uniformity of p D ( ) is empirically testable.
Corollary 5. Let Y be a discrete or continuous random variable. Let A, C, and D be categorical random variables with levels As in the monotonic case, an analogous result to that in Theorem 3 holds for E Y a [ | ], S a , and E Y a [ ].
As in the monotonic case, the results above carry over to standardizing other association measures than the conditional expectation, e.g., the conditional CDF. The only difference being that the previous require- being monotonic in C is now replaced by the requirements of Theorem 3. Although we have restricted consideration to the causal graph to the left in Figure 1, these results apply in a greater space of diagrams and settings. For instance, consider the causal graph in the center of the Figure 1. Let A, B, and Z be categorical random variables with equal cardinality. Let U be unobserved. The average causal effect of the exposure level a on Y can be computed as by the front-door criterion [6, Section 3.3.2]. Now, suppose that A is not observed but B is observed. Then, . Let S a crude and S a obs denote the results of plugging these two approximations in equation (1). Under some conditions, we can conclude that S a obs comes closer to E Y a [ ] than S a crude , and whether S a obs is an upper or a lower bound of E Y a [ ]. Specifically, suppose that p Z A ( | ) is known, e.g., from substantive knowledge or from a previous study. Suppose also that p A ( ) is a discrete uniform distribution, and p Z A for all z by Corollary 6 and, thus,

Binary exposure and confounder
In this section, we consider binary A, C, and D and replace the previous assumptions that p A C ( | ) and p D C ( | ) are preferential distributions. The new assumptions yield weaker but still useful results. For instance, the following two theorems do not determine the order between RD crude and RD obs , but they may determine whether RD true is positive or negative. More specifically, the first result below allows us to conclude that RD 0 true < whenever RD 0 crude < or RD 0 obs < .
Theorem 7. Let Y be a discrete or continuous random variable. Let A, C, and D be binary random variables.

Sensitivity to assumption violations
In this section, we study how sensitive the results of Theorem 3 are to small departures from the assumptions. Specifically, we study how the ordering of RD true , RD crude , and RD obs changes when p C ( ), p A C ( | ), and p D C ( | ) are replaced by distributions that slightly deviate from the assumptions in the theorem. We use the superscript * to denote the deviating probability distributions and their corresponding causal risk differences. The following definition formalizes what we mean by probabilities and risk differences deviating from the ones in Theorem 3. We say that a quantity r is an ε-approximation to a quantity s if Note that r is an ε-approximation to s if and only if s is an ε-approximation to r. Note also that r and s must be of the same sign. Therefore, in order for the ε-approximation definition to be well-defined for r RD , RD  and RD obs * is the same as the ordering between RD crude and RD obs , then the ordering among RD true * , RD crude * , and RD obs * is the same as the ordering among RD true , RD crude , and RD obs .
We performed simulations to investigate the conclusion above that Theorem 3 is not too sensitive to small departures from the assumptions. Specifically, we let A, C, and D be ternary random variables, and Y be binary. Therefore, these simulations somehow complement the sensitivity analysis above, in the sense that we do not assume that . We consider p C 1 3, 1 3, 1 3 ( ) ( ) = / / / , and p Y a c 1 ,( | ) = Uniform 0, 1 ( ) for all a and c. Moreover, we consider / to avoid almost deterministic relations, and likewise for p D C ( | ). We consider four values for ε in the interval [0.03, 0.15]. For each value of ε, we sample 10,000 sets of distributions  Figure 2 shows the percentage of violations. Although the percentage of violations increases with increasing ε, it does not do so abruptly. Thus, the theorem is likely to hold when the conditions are only approximately met. In further simulations we found, not presented here, that the percentage of violations grows linearly up to ε 1.1 = where it plateaus at 22%.

Discussion
It may seem that adjusting for a proxy of a latent confounder is always superior to not adjusting. Unfortunately, this is true only in some cases. In this article, we have characterized two such cases: one for monotonic settings and the other for non-monotonic settings. For each case, we have described sufficient conditions under which adjusting for a proxy of a latent confounder comes closer to the unidentifiable true average causal effect than not adjusting at all. We have also shown that the result for non-monotonic settings may continue to hold under small violations of the assumptions. Thus, it is likely our suggested set of assumptions is not the weakest possible set of assumptions that guarantees the result. However, in this same non-monotonic setting we have argued that the assumptions are not excessive, as we easily found counterexamples where the result did not hold when the assumptions were violated. The assumptions in this work are not empirically testable in most settings, and thus, substantive knowledge is needed to confirm that they hold. It is of note that when A, C, D, and Y are all continuous and follow the linear structural equation model represented by the path diagram to the right in Figure 1, it is possible to construct testable hypothesis for the conditions, as we demonstrate in Appendix E. It is a future area of research to investigate if the assumptions can be replaced by empirically testable assumptions that are still realistic in other settings.
On the bias of adjusting for a non-differentially mismeasured discrete confounder  235 Actually, it suffices to show that for all c. Then, equation (2) can be rewritten as Now, note that the second, third, and fourth terms of the left side coincide, respectively, with the third, second, and fourth terms of the right side. Finally, the first term of the left side is equal or greater than the first term of the right side by the assumption that p a C ( | ) is non-decreasing in C. □ Moreover, we prove below that p d C p a d d ( | ) ( | ) ∑ / is non-increasing in C. Then, because p a 1 ( ) / is constant in C, one of the following cases must hold: equation (5) is positive for all c, or it is negative for all c, or it is negative for all c below some cutoff and positive for all c above the cutoff. In all three cases, equation (4) is minimized by setting E Y a C , is non-decreasing in C by assumption. This implies that the minimum of equation ( / is non-increasing in C. We now argue that p a D ( | ) is non-decreasing in D or, equivalently, that We actually prove that

B Proofs for Section 4
Lemma 15. Let Y be a discrete or continuous random variable. Let A, C, and D be categorical random variables with levels K 1, 2, , … . Let p C ( ) be a discrete uniform distribution, and let p A C ( | ) and p D C ( | ) be preferential distributions. Then, p A ( ) and p D ( ) are discrete uniform distributions.
Proof. We prove that p A ( ) is uniform by considering any two levels of A, say a and a, and noting that which is a contradiction.
where the sixth equality follows from the assumption that p C ( ) is uniform.
for any c c a ≠ and c c a ≠ , and δ γ K 1 [ | ] ( | ) where the third equality follows from the fact that Y and D are conditionally independent given A and C due to the causal graph under consideration. Then, RD RD