BY 4.0 license Open Access Published by De Gruyter September 22, 2021

On the bias of adjusting for a non-differentially mismeasured discrete confounder

Jose M. Peña, Sourabh Balgi, Arvid Sjölander and Erin E. Gabriel

Abstract

Biological and epidemiological phenomena are often measured with error or imperfectly captured in data. When the true state of this imperfect measure is a confounder of an outcome exposure relationship of interest, it was previously widely believed that adjustment for the mismeasured observed variables provides a less biased estimate of the true average causal effect than not adjusting. However, this is not always the case and depends on both the nature of the measurement and confounding. We describe two sets of conditions under which adjusting for a non-deferentially mismeasured proxy comes closer to the unidentifiable true average causal effect than the unadjusted or crude estimate. The first set of conditions apply when the exposure is discrete or continuous and the confounder is ordinal, and the expectation of the outcome is monotonic in the confounder for both treatment levels contrasted. The second set of conditions apply when the exposure and the confounder are categorical (nominal). In all settings, the mismeasurement must be non-differential, as differential mismeasurement, particularly an unknown pattern, can cause unpredictable results.

MSC 2010: 62D20

1 Introduction

In observational studies, it is often of interest to estimate the average causal effect of an exposure on an outcome when this relationship is confounded. The confounder may be non-differentially mismeasured, resulting in the observation of a proxy. Non-differentially mismeasurement means that the proxy is conditionally independent of exposure and the outcome given the true confounder.

For binary true confounder and proxy, Greenland [1] argued that adjusting for the proxy produces a partially adjusted measure of the average causal effect of the exposure on the outcome that is between the crude (i.e., unadjusted) and the true (i.e., adjusted for the true confounder) measures. Thus, the partially adjusted measure comes closer to the unidentifiable true measure than the crude one. Ogburn and VanderWeele [2] showed that this result does not always hold. For binary exposure, true confounder, and proxy, they also showed that the result holds if the conditional expectation of the outcome is monotonic in the true confounder, i.e., it is either non-decreasing or non-increasing in the confounder for both exposure levels contrasted. Unfortunately, the condition cannot be tested explicitly because the true confounder is unobserved. Ogburn and VanderWeele [3] extended these results to binary exposure and ordinal true confounder and proxy, still under the monotonicity assumption. For binary exposure, true confounder, and proxy, Peña [4] showed that monotonicity in the true confounder holds if monotonicity in the proxy exists, which can be tested empirically.

Although monotonicity may seem like a strong condition, Ogburn and VanderWeele [2] argued that it is likely to hold in most applications in epidemiology. However, it is not difficult to come up with examples that are not monotonic. Ogburn and VanderWeele [2] gave an example of the causal effect of type 2 diabetes on hypertension when confounded by the use of thiazide, a drug to treat hypertension. The drug is known to lower the risk of hypertension in the general population, but it is also known to increase blood glucose levels increasing the risk of hypertension for patients with diabetes. Thus, the drug may have a non-monotone effect by causing harm in patients with diabetes, while being protective for all other patients. Monotonicity is not a necessary condition for improved partial adjustment. Peña [4] characterized non-monotonic cases for binary exposure, true confounder, and proxy, where the partially adjusted average causal effect is still between the crude and the true ones.

We provide new results for both monotonic and non-monotonic settings. For the monotonic settings, we show that the proof of the result in ref. [3] also applies when the exposure is non-binary under additional assumptions.

2 Preliminaries

Let A be the exposure of interest, and Y the outcome, which are confounded by the unobserved C . Let D be the non-differentially mismeasured proxy of C , where ( A , Y ) D C . The causal diagram depicting this setting is in the left most panel of Figure 1. We use upper-case letters to denote random variables, and the same letters in lower-case to denote fixed (e.g., by observation or by intervention) values. For simplicity, we use p ( ) to denote both probability distributions and density functions.

Figure 1 
               Causal graphs where 
                     
                        
                        
                           C
                        
                        C
                     
                   and 
                     
                        
                        
                           U
                        
                        U
                     
                   are unobserved.

Figure 1

Causal graphs where C and U are unobserved.

We can now more clearly define monotonicity. E [ Y A , C ] is monotone in a binary confounder C for the exposure levels a and a ¯ if

  • E [ Y A = a , C = 1 ] E [ Y A = a , C = 0 ] for a ( a , a ¯ ) , or

  • E [ Y A = a , C = 1 ] E [ Y A = a , C = 0 ] for a ( a , a ¯ ) .

This is easily extended to ordinal C . Given an exposure level a and a ¯ , and for all level of C , such that i < j if:

  • E [ Y A = a , C = i ] E [ Y A = a , C = j ] for a ( a , a ¯ ) , or

  • E [ Y A = a , C = i ] E [ Y A = a , C = j ] for a ( a , a ¯ ) ,

then E [ Y A , C ] is monotonic in C , and likewise for p ( a C ) .

For ordinal C and D , Ogburn and VanderWeele [3] introduced the concept of tapered misclassification probabilities: If p ( D = i C = j ) p ( D = i C = k ) and p ( D = j C = i ) p ( D = k C = i ) for all j < k i , and p ( D = i C = j ) p ( D = i C = k ) and p ( D = j C = i ) p ( D = k C = i ) for all i j < k , then p ( D C ) is said to be a tapered distribution. Intuitively, this means that, for any level i of C or D , the probability of correct classification into level i is equal or greater than the probability of misclassification into any other level and, moreover, the latter is non-increasing in each direction away from i .

Similar to a tapered misclassification, for categorical C and D , p ( D C ) is a preferential distribution if there is a permutation π D of the levels of D such that

p ( d c ) = p D if d = π D ( c ) q D otherwise

with p D > 1 / K and q D = ( 1 p D ) / ( K 1 ) . Note that p D and q D do not depend on either c or d . If p ( d c ) = p D , then we say that the level c prefers the level d or, equivalently, that the level d is preferred by the level c . We denote such levels of C and D by c d and d c , respectively. That p ( A C ) is preferential is defined likewise, with π A , p A , q A , c a , and a c defined accordingly. Preferential distributions resemble tapered distributions, but they differ in that they apply to categorical variables, and the misclassification probabilities are all equal. For example, say that we are interested in the average causal effect of neighborhood ( A ) on math achievements ( Y ) when confounded by parental education ( C ). Although we do not know the parents’ education, we do know their occupation ( D ). If individuals whose parents have education c prefer to live in the neighborhood a c and have no preferences for the other neighborhoods and parents with education c prefer the occupation d c and have no preferences for the other occupations, then p ( A C ) and p ( D C ) are preferential distributions.

When C is sufficient for confounding control (as in the left most panel of Figure 1), the true causal average treatment effect, or risk difference, between two exposure levels a and a ¯ is identifiable if C is observed. Let Y a denote the potential outcome for a given subject, had that subject received exposure level A = a . The true causal risk difference is then defined as:

RD true = E [ Y a ] E [ Y a ¯ ] = c E [ Y a , c ] p ( c ) c E [ Y a ¯ , c ] p ( c ) .

Thus, E [ Y a ] is a potential outcome measure that is obtained by standardizing E [ Y a , C ] to p ( C ) . When C is unobserved, RD true is not identifiable. However, it can be approximated by the unadjusted or crude risk difference:

RD crude = E [ Y a ] E [ Y a ¯ ]

and by the partially adjusted or observed risk difference:

RD obs = S a S a ¯ = d E [ Y a , d ] p ( d ) d E [ Y a ¯ , d ] p ( d ) .

Note that just like E [ Y a ] , S a and E [ Y a ] are obtained by standardizing E [ Y a , D ] to p ( D ) and p ( D a ) , respectively.

We may be interested in standardizing other C - and D -conditional association measures between A and Y than the expectation. For instance, we may be interested in the probability that the potential outcome is greater than a given value. That is, we may be interested in the cumulative distribution function (CDF) F Y a ( y ) = p ( Y a y ) at a fixed value y , which is obtained by standardizing F Y a , C ( y ) = p ( Y y a , C ) to p ( C ) . When C is unobserved, it can be approximated by standardizing F Y a , D ( y ) to p ( D ) or p ( D a ) .

3 Monotonic settings

We wish to know which of the two approximations comes closer to the unidentifiable true causal quantity. This section answers the question under some assumptions. The quality of RD crude and RD obs as approximations of RD true is related to the quality of E [ Y a ] and S a as approximations of E [ Y a ] . Thus, we study the latter question first. Specifically, the following theorem shows that S a comes closer to E [ Y a ] than E [ Y a ] . Moreover, the theorem allows us to conclude whether S a is an upper or a lower bound of E [ Y a ] .

Theorem 1

Let A and Y be discrete or continuous random variables. Let C and D be ordinal random variables with levels 1 , 2 , , K . Let p ( A , D ) be strictly positive. Let A = a be any exposure level such that p ( a C ) and E [ Y a , C ] are both monotonic in C . Let p ( D C ) be a tapered distribution. Then, S a E [ Y a ] if and only if E [ Y a ] S a .

The previous theorem carries over to standardized conditional CDFs if F Y a , C ( y ) is monotonic in C , because the proof does not depend on the association measure being standardized. That F Y a , C ( y ) is monotonic in C means that either F Y a , C = i ( y ) F Y a , C = j ( y ) for all i < j or F Y a , C = i ( y ) F Y a , C = j ( y ) for all i < j . When the former inequality applies to any fixed value y , it is known as first-order stochastic dominance of F Y a , C = i ( y ) over F Y a , C = j ( y ) , and it is then a sufficient (but not necessary) condition for E [ Y a , C = i ] E [ Y a , C = j ] [5, Section 3.2].

The following theorem answers our original question by showing that RD obs is a better approximation to RD true than RD crude . The theorem extends a similar result in ref. [3, Theorem 1] for binary A .

Theorem 2

Let A and Y be discrete or continuous random variables. Let C and D be ordinal random variables with levels 1 , 2 , , K . Let p ( A , D ) be strictly positive. Let A = a and A = a ¯ be two exposure levels such that one of p ( a C ) and p ( a ¯ C ) is non-decreasing and the other is non-increasing in C . Let E [ Y a , C ] and E [ Y a ¯ , C ] be both monotonic in C . Let p ( D C ) be a tapered distribution. Then, RD obs lies between RD true and RD crude .

Note that the theorem allows us to conclude whether RD obs is an upper or a lower bound of the unidentifiable RD true . That is, RD obs RD true if and only if RD crude RD obs , and the latter inequality is empirically testable. Note the assumption about p ( a ¯ C ) in the theorem. This is unnecessary in the result in ref. [3] because, when A is binary, the assumption about p ( a C ) implies the assumption about p ( a ¯ C ) . Although the proofs of both results are similar, we include the proof of our result in the Appendix because the proof in ref. [3] is rather brief and lacks important details, such as the requirement that p ( A , D ) is strictly positive.

Like Theorems 1 and 2 carries over to differences between standardized conditional CDFs if F Y a , C ( y ) is monotonic in C , which allows us to sort the true, crude and observed CDF differences between two exposure levels. Finally, Theorem 2 holds in fact for any effect measure that can be written as h ( g ( E [ Y a ] ) g ( E [ Y a ¯ ] ) ) , where g ( ) and h ( ) are monotonic functions. Therefore, the theorem holds for, for instance, the causal risk ratio and causal odds ratio.

4 Non-monotonic settings

In this section, we do not assume that E [ Y A , C ] is monotonic in C for the two exposure levels contrasted. Instead, we identify alternative assumptions that still allow us to conclude that RD obs lies between RD true and RD crude and, thus, that RD obs comes closer to the unidentifiable RD true than RD crude . Specifically, consider again the causal graph to the left in Figure 1.

Theorem 3

Let Y be a discrete or continuous random variable. Let A , C , and D be categorical random variables with levels 1 , 2 , , K . Let p ( C ) be a discrete uniform distribution, and let p ( A C ) and p ( D C ) be preferential distributions. Then, for any two exposure levels a and a ¯ , RD obs lies between RD true and RD crude .

Note that the theorem also allows us to conclude whether RD obs is an upper or a lower bound of the unidentifiable RD true . That is, RD obs RD true if and only if RD crude RD obs , and the latter inequality is empirically testable.

The following corollary shows that, as expected, the bias of RD obs decreases with increasing quality of D as a proxy of C (except when p A = 1 because, in that case, Y is conditionally independent of D given A and, thus, RD obs reduces to RD crude ).

Corollary 4

Under the conditions of Theorem 3, if p A < 1 , then the difference between RD obs and RD true decreases as p D increases.

Counterexamples to Theorem 3 can be constructed when the conditions are violated, via the code provided at: https://www.dropbox.com/s/pv2y84f4w38be8i/necessityBinary.R?dl=0 for binary variables, and https://www.dropbox.com/s/5psjqujfqj6v5d9/necessityTernary.R?dl=0  for ternary variables. Note that one must rely on knowledge external to the observed data to verify the conditions in the theorem, since C is unobserved. The following corollary partially alleviates this, since the uniformity of p ( D ) is empirically testable.

Corollary 5

Let Y be a discrete or continuous random variable. Let A , C , and D be categorical random variables with levels 1 , 2 , , K . Then, p ( C ) is a discrete uniform distribution and p ( D C ) is a preferential distribution if and only if p ( D ) is a discrete uniform distribution and p ( C D ) is a preferential distribution. Likewise replacing D with A .

As in the monotonic case, an analogous result to that in Theorem 3 holds for E [ Y a ] , S a , and E [ Y a ] .

Corollary 6

Under the conditions of Theorem 3, S a E [ Y a ] if and only if E [ Y a ] S a .

As in the monotonic case, the results above carry over to standardizing other association measures than the conditional expectation, e.g., the conditional CDF. The only difference being that the previous requirement of F Y A , C ( y ) being monotonic in C is now replaced by the requirements of Theorem 3.

Although we have restricted consideration to the causal graph to the left in Figure 1, these results apply in a greater space of diagrams and settings. For instance, consider the causal graph in the center of the Figure 1. Let A , B , and Z be categorical random variables with equal cardinality. Let U be unobserved. The average causal effect of the exposure level a on Y can be computed as

(1) E [ Y a ] = z p ( z a ) E [ Y z ] = z p ( z a ) a ¯ E [ Y z , a ¯ ] p ( a ¯ )

by the front-door criterion [6, Section 3.3.2]. Now, suppose that A is not observed but B is observed. Then, E [ Y z ] is not computable but it can be approximated by E [ Y z ] or S z = b E [ Y z , b ] p ( b ) . Let S a crude and S a obs denote the results of plugging these two approximations in equation (1). Under some conditions, we can conclude that S a obs comes closer to E [ Y a ] than S a crude , and whether S a obs is an upper or a lower bound of E [ Y a ] . Specifically, suppose that p ( Z A ) is known, e.g., from substantive knowledge or from a previous study. Suppose also that p ( A ) is a discrete uniform distribution, and p ( Z A ) and p ( B A ) are preferential distributions. Now, note that if E [ Y z ] S z for all z , then S z E [ Y z ] for all z by Corollary 6 and, thus, S a crude S a obs E [ Y a ] . Likewise, if E [ Y z ] S z for all z , then S a crude S a obs E [ Y a ] .

4.1 Binary exposure and confounder

In this section, we consider binary A , C , and D and replace the previous assumptions that p ( A C ) and p ( D C ) are preferential distributions. The new assumptions yield weaker but still useful results. For instance, the following two theorems do not determine the order between RD crude and RD obs , but they may determine whether RD true is positive or negative. More specifically, the first result below allows us to conclude that RD true < 0 whenever RD crude < 0 or RD obs < 0 .

Theorem 7

Let Y be a discrete or continuous random variable. Let A , C , and D be binary random variables. Let p ( c ) = 0.5 , p ( a ¯ c ¯ ) p ( a c ) 0.5 , and p ( d ¯ c ¯ ) p ( d c ) 0.5 . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true and RD obs RD true . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true and RD obs RD true .

Theorem 8

Let Y be a discrete or continuous random variable. Let A , C , and D be binary random variables. Let p ( c ) = 0.5 , p ( a c ) p ( a ¯ c ¯ ) 0.5 , and p ( d c ) p ( d ¯ c ¯ ) 0.5 . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true and RD obs RD true . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true and RD obs RD true .

Note that p ( A C ) is not necessarily a preferential distribution in the theorems above, because may be p ( a c ) > p ( a ¯ c ¯ ) > 0.5 , i.e., the probabilities of the preferred levels are not equal. Likewise for p ( D C ) .

For example, let A , D , and Y represent three diseases, and C a gene variant that affects the three of them. Moreover, suppose that suffering A affects the risk of suffering Y . The first result in Theorem 7 holds if (i) half of the population carry the gene variant C , i.e., p ( c ) = 0.5 , (ii) not carrying C protects against A and D more than carrying it predisposes to suffer the diseases, i.e., p ( a ¯ c ¯ ) p ( a c ) 0.5 and p ( d ¯ c ¯ ) p ( d c ) 0.5 , and (iii) carrying C increases the average severity of Y for the individuals suffering A more than it decreases the average severity for the rest, i.e., E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 .

The previous two theorems can be strengthened for RD crude by dropping the assumption that p ( c ) = 0.5 . However, analogous results do not hold for RD obs .

Theorem 9

Let Y be a discrete or continuous random variable. Let A and C be binary random variables. Let p ( c ) 0.5 and p ( a ¯ c ¯ ) p ( a c ) 0.5 . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true .

Theorem 10

Let Y be a discrete or continuous random variable. Let A and C be binary random variables. Let p ( c ) 0.5 and p ( a c ) p ( a ¯ c ¯ ) 0.5 . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true . If E [ Y a , c ] E [ Y a , c ¯ ] E [ Y a ¯ , c ¯ ] E [ Y a ¯ , c ] 0 , then RD crude RD true .

4.2 Sensitivity to assumption violations

In this section, we study how sensitive the results of Theorem 3 are to small departures from the assumptions. Specifically, we study how the ordering of RD true , RD crude , and RD obs changes when p ( C ) , p ( A C ) , and p ( D C ) are replaced by distributions that slightly deviate from the assumptions in the theorem. We use the superscript to denote the deviating probability distributions and their corresponding causal risk differences. The following definition formalizes what we mean by probabilities and risk differences deviating from the ones in Theorem 3. We say that a quantity r is an ε -approximation to a quantity s if

ε log r s ε

or, in other words, exp ( ε ) s r exp ( ε ) s . Note that r is an ε -approximation to s if and only if s is an ε -approximation to r . Note also that r and s must be of the same sign. Therefore, in order for the ε -approximation definition to be well-defined for r = RD true , RD obs , RD crude and s = RD true , RD obs , RD crude , respectively, we limit our analysis to those cases where E [ Y a , c ] 0 E [ Y a ¯ , c ] for all c and the given exposure levels a and a ¯ . This implies that RD true , RD obs , RD crude , RD true , RD obs , and RD crude are all non-negative. Specifically, note that RD true , RD true 0 follows from E [ Y a , c ] 0 E [ Y a ¯ , c ] for all c . Likewise, RD obs , RD obs 0 follows from E [ Y a , d ] 0 E [ Y a ¯ , d ] for all d , which follows by noting that E [ Y a , d ] = c E [ Y a , c ] p ( c a , d ) 0 and similarly E [ Y a ¯ , d ] 0 . Finally, RD crude , RD crude 0 follows from E [ Y a ] 0 E [ Y a ¯ ] , which follows by noting that E [ Y a ] = c E [ Y a , c ] p ( c a ) 0 and similarly E [ Y a ¯ ] 0 .

The following results indicate that Theorem 3 is not too sensitive to small violations of the assumptions.

Theorem 11

Assume that the conditions of Theorem 3hold. Moreover, let p ( c ) , p ( a c ) , and p ( d c ) be ε -approximations to p ( c ) , p ( a c ) , and p ( d c ) for all a , c , and d . Given two exposure levels a and a ¯ , if E [ Y a , c ] 0 E [ Y a ¯ , c ] for all c and log RD true RD obs 3 ε and log RD obs RD crude 6 ε , then the ordering among RD true , RD crude , and RD obs is the same as the ordering among RD true , RD crude , and RD obs .

Corollary 12

Assume that the conditions of Theorem 3hold. Moreover, let p ( c ) , p ( a c ) , and p ( d c ) be ε -approximations to p ( c ) , p ( a c ) , and p ( d c ) for all a , c , and d . Given two exposure levels a and a ¯ , if E [ Y a , c ] 0 E [ Y a ¯ , c ] for all c and log RD true RD obs 3 ε and the ordering between RD crude and RD obs is the same as the ordering between RD crude and RD obs , then the ordering among RD true , RD crude , and RD obs is the same as the ordering among RD true , RD crude , and RD obs .

We performed simulations to investigate the conclusion above that Theorem 3 is not too sensitive to small departures from the assumptions. Specifically, we let A , C , and D be ternary random variables, and Y be binary. Therefore, these simulations somehow complement the sensitivity analysis above, in the sense that we do not assume that E [ Y a , c ] 0 E [ Y a ¯ , c ] . We consider p ( C ) = ( 1 / 3 , 1 / 3 , 1 / 3 ) , and p ( Y = 1 a , c ) Uniform ( 0 , 1 ) for all a and c . Moreover, we consider

  • p ( A C = 1 ) = ( p A , ( 1 p A ) / 2 , ( 1 p A ) / 2 ) ,

  • p ( A C = 2 ) = ( ( 1 p A ) / 2 , p A , ( 1 p A ) / 2 ) , and

  • p ( A C = 3 ) = ( ( 1 p A ) / 2 , ( 1 p A ) / 2 , p A ) ,

with p A Uniform ( 1 / 3 , 3 / 4 ) to avoid almost deterministic relations, and likewise for p ( D C ) . We consider four values for ε in the interval [0.03, 0.15]. For each value of ε , we sample 10,000 sets of distributions { p ( C ) , p ( A C ) , p ( D C ) , p ( Y A , C ) } such that p ( c ) , p ( a c ) , and p ( d c ) are ε -approximations to p ( c ) , p ( a c ) , and p ( d c ) for all a , c , and d . The code to generate the simulations is available at https://www.dropbox.com/s/oh9pazehqkp8ty3/necessityCategorical_experiments_random_pref_distWeb.zip?dl=0.

The top plot in Figure 2 summarizes the distributions p ( C ) sampled, demonstrating how much they deviate from p ( C ) . Similar deviations are observed for p ( A c ) and p ( D c ) with respect to p ( A c ) and p ( D c ) for all c . For each set of distributions sampled, we checked whether the conclusion of Theorem 3 holds or not. The bottom plot in Figure 2 shows the percentage of violations. Although the percentage of violations increases with increasing ε , it does not do so abruptly. Thus, the theorem is likely to hold when the conditions are only approximately met. In further simulations we found, not presented here, that the percentage of violations grows linearly up to ε = 1.1 where it plateaus at 22%.

Figure 2 
                  Top: Summary of the 10,000 distributions 
                        
                           
                           
                              
                                 
                                    p
                                 
                                 
                                    ∗
                                 
                              
                              
                                 (
                                 
                                    C
                                 
                                 )
                              
                           
                           {p}^{\ast }\left(C)
                        
                      sampled for different values of 
                        
                           
                           
                              ε
                           
                           \varepsilon 
                        
                     . Bottom: Percentage of the sampled distributions for which the conclusion of Theorem 3 does not hold.

Figure 2

Top: Summary of the 10,000 distributions p ( C ) sampled for different values of ε . Bottom: Percentage of the sampled distributions for which the conclusion of Theorem 3 does not hold.

5 Discussion

It may seem that adjusting for a proxy of a latent confounder is always superior to not adjusting. Unfortunately, this is true only in some cases. In this article, we have characterized two such cases: one for monotonic settings and the other for non-monotonic settings. For each case, we have described sufficient conditions under which adjusting for a proxy of a latent confounder comes closer to the unidentifiable true average causal effect than not adjusting at all. We have also shown that the result for non-monotonic settings may continue to hold under small violations of the assumptions. Thus, it is likely our suggested set of assumptions is not the weakest possible set of assumptions that guarantees the result. However, in this same non-monotonic setting we have argued that the assumptions are not excessive, as we easily found counterexamples where the result did not hold when the assumptions were violated.

The assumptions in this work are not empirically testable in most settings, and thus, substantive knowledge is needed to confirm that they hold. It is of note that when A , C , D , and Y are all continuous and follow the linear structural equation model represented by the path diagram to the right in Figure 1, it is possible to construct testable hypothesis for the conditions, as we demonstrate in Appendix E. It is a future area of research to investigate if the assumptions can be replaced by empirically testable assumptions that are still realistic in other settings.

Acknowledgements

The authors gratefully acknowledge financial support from the Swedish Research Council (VR SWE-REG grants 2019-00245 and 2019-00227).

  1. Conflict of interest: Authors state no conflict of interest.

Appendix A Proofs for Section 3

Lemma 13

Let A and Y be discrete or continuous random variables. Let C and D be ordinal random variables with levels 1 , 2 , , K . Let p ( A , D ) be strictly positive. Let A = a be any exposure level. If p ( a C ) and E [ Y a , C ] are both non-decreasing or non-increasing in C , then S a E [ Y a ] . If one of p ( a C ) and E [ Y a , C ] is non-decreasing and the other non-increasing in C , then S a E [ Y a ] .

Proof

We prove that S a E [ Y a ] when p ( a C ) and E [ Y a , C ] are both non-decreasing in C . The other cases can be proven analogously. Specifically, we want to show that

(2) S a = d E [ Y a , d ] p ( d ) = c d E [ Y a , c , d ] p ( c a , d ) p ( d ) = c E [ Y a , c ] d p ( c a , d ) p ( d ) c E [ Y a , c ] p ( c ) = E [ Y a ] .

Actually, it suffices to show that

(3) c k d p ( c a , d ) p ( d ) c k p ( c )

for all k . To see it, let E c = E [ Y a , c ] , α c = d p ( c a , d ) p ( d ) , and β c = p ( c ) for all c . Then, equation (2) can be rewritten as

S a = E 1 α 1 + + E K α K = E 1 ( α 1 + + α K ) + ( E 2 E 1 ) ( α 2 + + α K ) + + ( E K E K 1 ) α K E 1 ( β 1 + + β K ) + ( E 2 E 1 ) ( β 2 + + β K ) + + ( E K E K 1 ) β K = E 1 β 1 + + E K β K = E [ Y a ] ,

where E k E k 1 0 for all k because E [ Y a , C ] is non-decreasing in C by assumption, and α 1 + + α K = β 1 + + β K = 1 . The latter is important because E 1 may be negative.

Proving equation (3) is equivalent to proving

d p ( d ) c k p ( c a , d ) d p ( d ) c k p ( c d )

and, thus, it suffices to prove that

c k p ( c a , d ) c k p ( c d )

holds for all d since p ( A , D ) is strictly positive by assumption or, equivalently, that

c k p ( a c ) p ( d c ) p ( c ) c p ( a c ) p ( d c ) p ( c ) c k p ( d c ) p ( c ) c p ( d c ) p ( c )

holds for all d . We actually prove that

c k p ( a c ) p ( d , c ) c p ( d , c ) c k p ( d , c ) c p ( a c ) p ( d , c ) .

To do so, we rewrite the previous inequality as

c k c ¯ p ( a c ) p ( d , c ) p ( d , c ¯ ) c k c ¯ p ( a c ¯ ) p ( d , c ) p ( d , c ¯ )

or, equivalently, as

c k c ¯ < k p ( a c ) p ( d , c ) p ( d , c ¯ ) + c k c > c ¯ k p ( a c ) p ( d , c ) p ( d , c ¯ ) + c k c ¯ > c k p ( a c ) p ( d , c ) p ( d , c ¯ ) + c k c ¯ = c k p ( a c ) p ( d , c ) p ( d , c ¯ ) c k c ¯ < k p ( a c ¯ ) p ( d , c ) p ( d , c ¯ ) + c k c > c ¯ k p ( a c ¯ ) p ( d , c ) p ( d , c ¯ ) + c k c ¯ > c k p ( a c ¯ ) p ( d , c ) p ( d , c ¯ ) + c k c ¯ = c k p ( a c ¯ ) p ( d , c ) p ( d , c ¯ ) .

Now, note that the second, third, and fourth terms of the left side coincide, respectively, with the third, second, and fourth terms of the right side. Finally, the first term of the left side is equal or greater than the first term of the right side by the assumption that p ( a C ) is non-decreasing in C .□

Lemma 14

Let A and Y be discrete or continuous random variables. Let C and D be ordinal random variables with levels 1 , 2 , , K . Let p ( D C ) be a tapered distribution. Let A = a be any exposure level. If p ( a C ) and E [ Y a , C ] are both non-decreasing or non-increasing in C , then E [ Y a ] S a . If one of p ( a C ) and E [ Y a , C ] is non-decreasing and the other non-increasing in C , then E [ Y a ] S a .

Proof

We prove that E [ Y a ] S a when p ( a C ) and E [ Y a , C ] are both non-decreasing in C . The other cases can be proven analogously. Specifically, we show that the minimum of E [ Y a ] S a with respect to E [ Y a , c ] is 0 for all c , regardless of p ( a C ) , p ( C ) , and p ( D C ) . Specifically, note that

(4) E [ Y a ] S a = c E [ Y a , c ] p ( c a ) d E [ Y a , d ] p ( d ) = c E [ Y a , c ] p ( c a ) d c E [ Y a , c , d ] p ( c a , d ) p ( d ) = c E [ Y a , c ] p ( a c ) p ( c ) p ( a ) d c E [ Y a , c ] p ( d c ) p ( c a ) p ( d a ) p ( d ) = c E [ Y a , c ] p ( a c ) p ( c ) p ( a ) d c E [ Y a , c ] p ( d c ) p ( a c ) p ( c ) p ( a ) p ( a d ) p ( d ) p ( a ) p ( d ) = c E [ Y a , c ] p ( a c ) p ( c ) p ( a ) d c E [ Y a , c ] p ( d c ) p ( a c ) p ( c ) p ( a d ) .

Therefore, the derivative of E [ Y a ] S a with respect to E [ Y a , c ] for any c is

(5) p ( a c ) p ( c ) 1 p ( a ) d p ( d c ) p ( a d ) .

Moreover, we prove below that d p ( d C ) / p ( a d ) is non-increasing in C . Then, because 1 / p ( a ) is constant in C , one of the following cases must hold: equation (5) is positive for all c , or it is negative for all c , or it is negative for all c below some cutoff and positive for all c above the cutoff. In all three cases, equation (4) is minimized by setting E [ Y a , C ] to be constant in C , because E [ Y a , C ] is non-decreasing in C by assumption. This implies that the minimum of equation (4) is 0.

Finally, note that if p ( a D ) is non-decreasing in D then, for any c and i < j , increasing p ( D = i c ) by α while reducing p ( D = j c ) by α increases the value of d p ( d c ) / p ( a d ) or leaves it unchanged. By the assumption that p ( D C ) is tapered, we can transform p ( D c ¯ ) into p ( D c ) for any c < c ¯ by moving probability mass as just explained and, moreover, d p ( d c ) / p ( a d ) d p ( d c ¯ ) / p ( a d ) . Therefore, d p ( d C ) / p ( a d ) is non-increasing in C . We now argue that p ( a D ) is non-decreasing in D or, equivalently, that

i p ( a C = i ) p ( C = i ) p ( d C = i ) j p ( C = j ) p ( d C = j ) i p ( a C = i ) p ( C = i ) p ( d ¯ C = i ) j p ( C = j ) p ( d ¯ C = j )

with d < d ¯ . We actually prove that

0 i j p ( a C = i ) p ( C = i ) p ( d ¯ C = i ) p ( C = j ) p ( d C = j ) i j p ( a C = i ) p ( C = i ) p ( d C = i ) p ( C = j ) p ( d ¯ C = j ) = i j < i [ p ( a C = i ) p ( a C = j ) ] × [ p ( C = i ) p ( d ¯ C = i ) p ( C = j ) p ( d C = j ) p ( C = i ) p ( d C = i ) p ( C = j ) p ( d ¯ C = j ) ] .

Now, simply note that the differences in square brackets are non-negative, because p ( a C ) is non-decreasing in C and p ( D C ) is tapered by assumption.□

Proof of Theorem 1

It follows from Lemmas 13 and 14.□

Proof of Theorem 2

Assume that E [ Y a , C ] and E [ Y a ¯ , C ] are both non-decreasing in C , and p ( a C ) and p ( a ¯ C ) are non-decreasing and non-increasing in C , respectively. Then, E [ Y a ] S a E [ Y a ] and E [ Y a ¯ ] S a ¯ E [ Y a ¯ ] by Lemmas 13 and 14. The other cases follow analogously.□

B Proofs for Section 4

Lemma 15

Let Y be a discrete or continuous random variable. Let A , C , and D be categorical random variables with levels 1 , 2 , , K . Let p ( C ) be a discrete uniform distribution, and let p ( A C ) and p ( D C ) be preferential distributions. Then, p ( A ) and p ( D ) are discrete uniform distributions.

Proof

We prove that p ( A ) is uniform by considering any two levels of A , say a and a ¯ , and noting that

p ( a ) = c p ( a c ) p ( c ) = p ( a c a ) p ( c a ) + c c c a p ( a c ) p ( c ) = p ( a ¯ c a ¯ ) p ( c a ¯ ) + c c c a ¯ p ( a ¯ c ) p ( c ) = p ( a ¯ ) .

Likewise for p ( D ) .□

Lemma 16

Let Y be a discrete or continuous random variable. Let A , C , and D be categorical random variables with levels 1 , 2 , , K . Let p ( C ) be a discrete uniform distribution, and let p ( A C ) and p ( D C ) be preferential distributions. Then, for any exposure level a , we have that p ( c a , d ) = p ( c ¯ a , d ¯ ) for all c , c ¯ , d , and d ¯ such that c = c a c ¯ = c a and d = d c d ¯ = d c ¯ . Moreover, for any two exposure levels a and a ¯ , we have that p ( c a , d ) = p ( c ¯ a ¯ , d ¯ ) for all c , c ¯ , d , and d ¯ such that c = c a c ¯ = c a ¯ and d = d c d ¯ = d c ¯ .

Proof

We prove first the result for the exposure level a . Note that p ( C D ) = p ( D C ) by Lemma 15. Then,

p ( c a , d ) p ( a c , d ) p ( c d ) = p ( a c ) p ( d c )

and likewise p ( c ¯ a , d ¯ ) p ( a c ¯ ) p ( d ¯ c ¯ ) . Then, p ( a c ) = p ( a c ¯ ) if c = c a c ¯ = c a , and p ( d c ) = p ( d ¯ c ¯ ) if d = d c d ¯ = d c ¯ . We now prove the result for two exposure levels a and a ¯ . Note that p ( c ¯ a ¯ , d ¯ ) p ( a ¯ c ¯ ) p ( d ¯ c ¯ ) . Then, p ( a c ) = p ( a ¯ c ¯ ) if c = c a c ¯ = c a ¯ , and p ( d c ) = p ( d ¯ c ¯ ) if d = d c d ¯ = d c ¯ .□

Lemma 17

Let Y be a discrete or continuous random variable. Let A , C , and D be categorical random variables with levels 1 , 2 , , K . Let p ( C ) be a discrete uniform distribution, and let p ( A C ) and p ( D C ) be preferential distributions. Then, for any exposure level a , we have that p ( d c a a ) p ( d c a ) and p ( c a a , d c a ) p ( c a a , d ) for any d d c a .

Proof

First, note that

p ( d c a a ) = c p ( d c a a , c ) p ( c a ) = c p ( d c a c ) p ( a c ) = p ( d c a c a ) p ( a c a ) + c c c a p ( d c a c ) p ( a c ) = p D p A + ( K 1 ) q D q A .

Likewise, for any d d c a , note that

p ( d a ) = c p ( d c ) p ( a c ) = p ( d c a ) p ( a c a ) + p ( d c d ) p ( a c d ) + c c c a , c d p ( d c ) p ( a c ) = q D p A + p D q A + ( K 2 ) q D q A .

Next, assume to the contrary that p ( d c a a ) < p ( d c a ) . Recall that p ( d c a ) = 1 / K by Lemma 15. Then,

p D p A + ( K 1 ) q D q A < 1 K p D p A + ( K 1 ) 1 p D K 1 1 p A K 1 < 1 K K p D p A + 1 p D p A K 1 < 1 K p A ( K p D 1 ) < K 1 K 1 + p D = K p D 1 K p A < 1 K ,

which is a contradiction.

Finally, assume to the contrary that p ( c a a , d c a ) < p ( c a a , d ) for any d d c a . Then,

p ( a c a , d c a ) p ( c a d c a ) p ( a d c a ) < p ( a c a , d ) p ( c a d ) p ( a d ) p ( a c a ) p ( c a d c a ) p ( a d ) < p ( a c a ) p ( c a d ) p ( a d c a ) p ( d c a c a ) p ( d a ) < p ( d c a ) p ( d c a a ) p D ( q D p A + p D q A + ( K 2 ) q D q A ) < q D ( p D p A + ( K 1 ) q D q A ) p D q D p A + p D p D q A + ( K 2 ) p D q D q A < q D p D p A + ( K 1 ) q D q D q A p D p D q A + ( K 2 ) p D q D q A < ( K 1 ) q D q D q A ,

which is a contradiction. The third inequality follows from Bayes’ rule.□

Proof of Theorem 3

We start by establishing a relationship between RD obs and RD true . First, note that

p ( a d ) = c p ( a d , c ) p ( c d ) = c p ( a c ) p ( c d ) c p ( a c a ) p ( c d ) = p ( a c a )

and, thus,

p ( c a a , d ) = p ( a c a , d ) p ( c a d ) p ( a d ) = p ( a c a ) p ( c a d ) p ( a d ) p ( c a d )

and, thus,

d p ( c a a , d ) p ( d ) d p ( c a d ) p ( d ) = p ( c a ) .

Then,

d p ( c a a , d ) p ( d ) = p ( c a ) + α

with α 0 . Likewise, p ( a d ) p ( a c ) for any c c a and, thus, p ( c a , d ) p ( c d ) and, thus, d p ( c a , d ) p ( d ) p ( c ) = p ( c a ) . The last equality follows from the assumption that p ( C ) is uniform. Moreover, the left side of the last inequality is constant for all c c a by Lemmas 15 and 16. Then,

d p ( c a , d ) p ( d ) = p ( c a ) β

for any c c a . Now, note that

p ( c a ) + α + ( K 1 ) ( p ( c a ) β ) = d p ( c a a , d ) p ( d ) + c c a d p ( c a , d ) p ( d ) = d p ( d ) p ( c a a , d ) + d p ( d ) c c a p ( c a , d ) = d p ( d ) c p ( c a , d ) = 1 ,

which implies that β = α / ( K 1 ) , because p ( c a ) = 1 / K by the assumption that p ( C ) is uniform. Moreover, Lemmas 15 and 16 imply that

d p ( c a ¯ a ¯ , d ) p ( d ) = p ( c a ¯ a ¯ , d c a ¯ ) p ( d c a ¯ ) + d d d c a ¯ p ( c a ¯ a ¯ , d ) p ( d ) = p ( c a a , d c a ) p ( d c a ) + d d d c a p ( c a a , d ) p ( d ) = d p ( c a a , d ) p ( d ) = p ( c a ) + α .

Lemmas 15 and 16 also imply that

d p ( c ¯ a ¯ , d ) p ( d ) = d p ( c a , d ) p ( d ) = p ( c a ) β

for any c c a and c ¯ c a ¯ . Consequently,

(6) RD obs = d E [ Y a , d ] p ( d ) d E [ Y a ¯ , d ] p ( d ) = d c E [ Y a , d , c ] p ( c a , d ) p ( d ) d c E [ Y a ¯ , d , c ] p ( c a ¯ , d ) p ( d ) = d c E [ Y a , c ] p ( c a , d ) p ( d ) d c E [ Y a ¯ , c ] p ( c a