In many observational data settings focused on the regression relationship between an exposure variable X and a disease outcome variable Y, the exposure variable is prone to measurement error. That is, the available data are realizations of rather than , where is a noisy surrogate for X. There is a substantial literature on measurement error problems, particularly involving nondifferential measurement error, whereby and Y are conditionally independent given X. Two typical hallmarks of nondifferential measurement error are (i) attenuation and (ii) power loss. Attenuation, or bias toward the null, refers to a regression coefficient describing the association being smaller in magnitude than the corresponding (X,Y) coefficient. Consequently, reporting inference based on data as targeting the relationship of interest is a biased procedure. Power loss refers to data yielding less power than data to detect association.
The maxim of nondifferential exposure measurement error leading to attenuation is widely appealed to in applied work; Jurek et al. (2008) refer to it as “a well-known heuristic in epidemiology.” As documented in Jurek et al. (2006), it is not uncommon to see inference on data reported without any quantitative adjustment for measurement error, but with a claim that this is a conservative procedure. The rationale for such a claim is that if an adjustment were undertaken, it would push the point estimate away from the null, in order to mitigate the attenuation. Against this, however, there are a number of cautionary papers in the literature. These point to exceptional circumstances where the maxim of attenuation can fail. For instance, see Dosemeci et al. (1990); Wacholder (1995); Carroll (1997); Jurek et al. (2005).
In specific contexts where power loss applies, there is recent work on quantifying the magnitude of the loss (Buonaccorsi et al. 2011; Vanderweele 2012). However, circumstances under which power loss is not guaranteed do not seem to have been investigated. This paper focusses on the situation where X is ordinal with more than two categories. We show that for a general test of association, nondifferential misclassification of X is indeed guaranteed to reduce power. However, another commonly used test when X is ordinal is a test of linear trend. This test is seen to constitute an exceptional circumstance, as nondifferential misclassification of X can in fact increase the power to detect a trend.
The exposure X is assumed ordinal with levels, represented as . The discussion applies to disease variables Y of various numerical types, for example, binary, count, continuous. We take to describe aspects of the joint distribution of , according to , , and , for . Exposure misclassification is manifested in terms of being observed, rather than X itself. Nondifferential misclassification is assumed throughout this paper, whereby and Y are conditionally independent given X. Thus, the misclassification is described by a single classification matrix P, with entries .
A given P induces a joint distribution of from the joint distribution of . The induced distribution can be described in similar terms as the original. For instance, is given by
Similarly, is determined by
where , hence,
The variance of is
or, more succinctly,
Thus, , which describes the joint distribution of , maps to , which describes the joint distribution of . As a minor comment, if we assume normality of , then completely characterizes the joint distribution of . However, it does not follow that completely characterizes the joint distribution of , as normality of induces a mixture of normal distributions for .
Generally, we expect less extreme exposure misclassification to result in less damage, when inferences based on the data are interpreted directly as applying to the relationship. To quantify the extent to which the exposure classification is nicely behaved, P is termed monotone if, for each i, and are both decreasing in for . That is, for each i, the distribution of is unimodal with mode at . Thus, monotonicity corresponds to a less good classification being less likely. A non-monotone P clearly corresponds to rather pathological misclassification.
Following Ogburn and Vanderweele (2013), if both P and are monotone, then P is said to be tapered, i.e., the classification probabilities taper off as one moves away vertically or horizontally from the diagonal of P. Hence, tapered misclassification corresponds to better behavior than merely monotone misclassification. We can also consider whether P is a banded matrix. For instance, a tridiagonal P rules out classifications that are off by more than one category. In qualitative terms, a P which is both tapered and tridiagonal might be regarded as the least threatening form of exposure misclassification. Another nice property for P to possess is , i.e., for all exposure levels correct classification is more likely than incorrect classification.
3 Hypothesis testing
Say that interest is focused on testing a null hypothesis of the form , for an appropriate matrix C, against a general alternative. If correctly classified data are available, in the form of n independent and identically distributed realizations of , then a test can proceed as follows. Let be the vector of response means for the k exposure groups, such that if of the n subjects have , then, at least approximately, 
where . For the sake of clear exposition, we proceed on the basis that , and hence , is known. In practice would be estimated, either directly on the basis of within-group variation in responses (e.g., if Y is continuous) or on the basis of a postulated mean–variance relationship.
Starting from eq. , standard linear model theory gives the test statistic 
as central distributed under the null, but noncentral distributed under the alternative, with noncentrality parameter . We gain insight by replacing the realized group sizes with the expected group sizes , for . Then the noncentrality parameter becomes , where 
in which is the diagonal matrix with entries . With the form  in hand, the question of whether or not power is lost as a result of misclassification is simply abstracted as the question of whether or not .
4 General test for association
A general test for exposure–disease association would focus on the null hypothesis . This null, concerning the distribution of , implies the corresponding null for , i.e., . In the case of binary exposure, it has been known since at least Bross (1954) that nondifferential misclassification induces power loss. However, the literature on misclassification with exposure categories has focused more on estimation than on testing [see, for instance, Dosemeci et al. (1990); Weinberg et al. (1994)]. In the present context, at least for the homoscedastic case that is constant, we can indeed confirm that power is lost as a result of having misclassified data rather than correctly classified data. The following result applies for any misclassification matrix P. In fact, the proof makes no use of the fact that X is ordered, so the result applies to categorical X more generally.
Theorem 1. Consider , corresponding to the null hypothesis of equal mean outcomes for all exposure levels. In the homoscedastic situation of equal variances (),
That is, nondifferential misclassification cannot increase the power for detecting a relationship between mean outcome and exposure level.
Proof. Direct calculation gives
with . Consequently,
Now, by the Cauchy–Schwarz inequality,
5 Test for trend across categories
As will be demonstrated momentarily, a version of Theorem 1 does not hold for the linear trend test. This test, versions of which were first proposed by Cochran (1954) and Armitage (1955), uses the linear contrast to define the linear component of . A direct interpretation of the test is that the null holds if and only if the least-squares slope when is regressed upon is zero.
To gain a general sense of how this test is affected by misclassification, we compare to for a large ensemble of , , and P values. Particularly, we fix exposure categories and also fix . Then values of are drawn from 12 different probability distributions. The first three distributions arise from fixing , thinking of a uniform distribution for X as being a simple but important special case. A distribution on is assigned by fixing without loss of generality, and then taking the increments to be distributed as , so that is increasing and is fixed. In our experiments we take , which allows the possibility of far from linear patterns (as c increases the distribution puts more weight on relationships which are closer to linear). Then the three distributions are based on sampling P from a specified distribution over tridiagonal classification matrices, and conditioning on either (i) P being non-monotone, (ii) P being monotone but not tapered, or (iii) P being tapered. The distribution on P generates tridiagonal matrices by independently drawing , , and , for . We use , giving a mean of 0.667 and a standard deviation of 0.085 for each diagonal element of P, before conditioning. This ensures that a wide range of classification matrices is being generated, with the typical extent of misclassification being quite considerable.
The next three distributions arise exactly as per the first three, except is now also taken as random, with , so that each has mean . This allows departures from X being uniformly distributed (which would correspond to ). We wish to take d small enough to engender substantial departures from uniformity, but not so small as to yield distributions that place almost no mass on one or more values. Thus we set , which induces a standard deviation of 0.053 for each .
Results for arising from these six distributions appear in Figure 1. For both the general test and the trend test, results are given in terms of relative power, defined as the power achieved from misclassified data at the sample size for which ideal data yield 80% power. That is, if denotes the distribution function for the noncentral distribution with noncentrality t, then the relative power is where solves . For the general test, Theorem 1 guarantees that the relative power cannot exceed 80%. Figure 1 shows contrary behavior for the trend test. At least when the distribution of X can depart from uniformity, there are values of P (of all three types) and and for which misclassification induces power gain. Of course these results are based on first-order asymptotic theory; in Appendix A we present a small simulation indicating that the theory does indeed translate to the finite-sample case. Also, while our simulated scenarios never produced a power gain when X is uniformly distributed, in Appendix B we exhibit a and a tridiagonal, tapered P such that there is power gain in this case.
The remaining six distributions for are constructed as per the first six described above, but with the underlying distribution of P modified so as to generate non-banded matrices. This is achieved by taking the rows of P to be independently distributed as Dirichlet over all k categories. To mimic the earlier construction, for each row the parameters of the Dirichlet distribution are taken to be a for the correct category and to sum to b over the incorrect categories, implying the same distribution for diagonal elements as previously. For all but the first and last rows, sub-sums of are forced, both below and above the correct category. The specification is completed by fixing a value of r such that the Dirichlet parameters for a given row decay geometrically by a factor of r as we move further away from the correct category. We take in generating the results of Figure 2. Qualitatively, we see very similar behavior as before with tridiagonal misclassification.
To seek further insight into when misclassification increases power, we focus on the situation of non-uniform X distributions and tapered misclassification matrices, i.e., the settings underlying the lower right panels of both Figures 1 and 2. For these simulated values, in Figure 3 we first plot the relative power of the trend test against the absolute ratio of to . Values of this ratio larger than one correspond to the misclassification inflating the slope which summarizes the linear component of the exposure–disease relationship. In situations involving linear regression of a continuous outcome on a continuous exposure, nondifferential exposure misclassification is quite widely guaranteed to attenuate the regression slope being estimated [see, for instance, Gustafson (2004); Carroll et al. (2006)]. Figure 3 shows that such a guarantee does not apply to the present situation, but it is perhaps not surprising that attenuation is much more common than inflation across the simulated scenarios. What is more surprising is that slope inflation is neither a necessary condition nor a sufficient condition for power gain. That is, in some scenarios the misclassification inflates the slope yet power is lost for detecting that the slope is not zero. And in some situations, the misclassification attenuates the slope, yet power is gained for detecting that the slope is not zero. So the behavior of the slopes alone is not determining whether power is lost or gained.
Next in Figure 3 we plot the relative power against , to see whether the worst probability of correct classification across exposure categories is a driving force behind the power of the test based on misclassified data. The plots provide a negative answer here, indicating very little association, if any, between this characteristic of the classification matrix and the relative power.
Finally, we consider the extent to which misclassification pushes the exposure distribution toward or away from uniformity. We define the log sum-of-squares ratio (LSSR) as So, for instance, a large positive value of LSSR describes a case where the misclassification induces a distribution of that is much further from uniform than the distribution of X. Figure 3 exhibits quite strong negative associations between relative power and LSSR, in both the tridiagonal and non-banded cases. Thus scenarios with power gain tend to have misclassification which makes much more uniformly distributed than X. But again this is only a stochastic tendency with respect to the chosen distribution of . A low LSSR is neither a necessary nor sufficient condition for power gain. In general, there seems to be considerable complexity in how the distribution of X, the distribution of , and the misclassification matrix P combine to determine the performance of the trend test applied to misclassified data.
The idea that nondifferential exposure misclassification causes attenuation in estimating exposure–disease relationships is well established in the literature, to the point that concerns are raised about this maxim being quoted in situations beyond its domain of applicability. The related idea that nondifferential exposure misclassification causes power loss for detecting exposure–disease association is also present in the literature, albeit with less emphasis. Of course, the situation is more nuanced with hypothesis testing than with estimation, since typically nondifferential exposure misclassification does not invalidate testing the way it invalidates estimation. That is, treating “as-is” results in valid hypothesis testing for association, but biased estimation of the magnitude of this association (when non-null). Consequently, comparing the power of a test using data to that of a test using data is a natural quantification of the impact of misclassification.
Just as exceptions to the attenuation maxim have been noted, here we have exhibited a situation where power loss is not guaranteed. In the ordinal exposure setting, nondifferential misclassification is guaranteed to reduce power for the general test of association, but not for the test of linear trend. We have not been able to give a deterministic characterization of the distributions for X and and the classification matrices P for which power is actually gained via misclassification. Indeed, our numerical results point to these three entities combining in a complicated manner to determine the extent to which power is lost, or perhaps gained. Looking for the parameter settings under which power gain occurs seems somewhat akin to looking for needles in a haystack. Thus, there is no “free lunch” – it would be foolish to deliberately add artificial misclassification to exposure data, in the hope of increasing the power of a given study. Rather, we simply think it important to add the potential for power gain with this test to the extant list of caveats concerning heuristics for exposure measurement error.
Settings where the noisy exposure is much more uniformly distributed than the true exposure are seen to be more prone to exhibiting power gain. Upon reflection, there may be some intuitive sense in this. Generally, a wider distribution of a regressor imbues more precise estimation of a regression slope, and this precision contributes to the power of the association test. However, in the linear regression setting, the benefit of increased variance of compared to X is more than offset by the attenuation in the slope itself and by the increase in residual variance of compared to . In the present setting of the trend test for an ordinal exposure, we still lack a complete understanding of why these offsetting mechanisms are not always in full force.
In the setting of a tapered and tridiagonal classification matrix, the highest relative power obtained in the simulation study was 92.7%, arising from and . Using these settings, we repeatedly generate 5,000 independent datasets, each comprised of observations on , under both the alternative (with the value of given above) and under the null (with ). Note that while n itself is large, for this choice of the expected number of subjects in the group is quite small. The simulation is carried out using , which is found empirically to yield about 80% power for the trend test using data. For each simulated dataset, a likelihood-ratio test is implemented using the test statistic  compared to the distribution, with the standard error for each group mean “plugged in” as if it were known. For simulation under the null, empirical type I error rates are 5.5% for data and also 5.5% for data, with Monte Carlo standard error of 0.3% in each case, i.e., the rates are within simulation error of nominal. For simulation under the alternative, empirical power is 78.5% for data and 92.3% for data, with a Monte Carlo standard error of 0.6% for the difference between the two powers. Thus, we have a finite-sample demonstration that the test is valid under the null for both X data and data, but the latter yields higher power under the alternative.
With , , , and we obtain . Thus with a uniform distribution over X and a tapered, tridiagonal classification matrix it is possible for power to increase upon misclassification.
Buonaccorsi, J. P., Laake, P., and Veierød, M. B. (2011). On the power of the Cochran – Armitage test for trend in the presence of misclassification. Statistical Methods in Medical Research. Web of SciencePubMedGoogle Scholar
Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. 2nd Edition. Boca Raton, FL: Chapman & Hall, CRC Press. Google Scholar
Dosemeci, M., Wacholder, S., and Lubin, J. H. (1990). Does nondifferential misclassification of exposure always bias a true effect toward the null value? American Journal of Epidemiology, 19:746–748. Google Scholar
Gustafson, P. (2004). Measurement Error and Misclassification in Statistics and Epidemiology: Impact and Bayesian Adjustments. Boca Raton, FL: Chapman & Hall, CRC Press. Google Scholar
Jurek, A. M., Greenland, S., and Maldonado, G. (2008). Brief report how far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null? International Journal of Epidemiology, 37:382–385. CrossrefGoogle Scholar
Jurek, A. M., Greenland, S., Maldonado, G., and Church, T. R. (2005). Proper interpretation of non-differential misclassification effects: expectations vs observations. International Journal of Epidemiology, 34:680–687. PubMedGoogle Scholar
Jurek, A. M., Maldonado, G., Greenland, S., and Church, T. R. (2006). Exposure-measurement error is frequently ignored when interpreting epidemiologic study results. European Journal of Epidemiology, 21:871–876. PubMedGoogle Scholar
Ogburn, E. L., and Vanderweele, T. J. (2013). Bias attenuation results for nondifferentially mismeasured ordinal and coarsened confounders. Biometrika, 100:241–248. CrossrefWeb of SciencePubMedGoogle Scholar
Weinberg, C. R., Umbach, D. M., and Greenland, S. (1994). When will nondifferential misclassification of an exposure preserve the direction of a trend? (with discussion). American Journal of Epidemiology, 140:565–571. Google Scholar