In this study, six approaches for measuring agreement on repeated binary outcomes were reviewed and applied to our illustrative dataset. In general, it is apparent that those methods should be applied under different situations. In , we list the details, main characteristics of each method and when we recommend using each of them. Generally speaking, those six approaches can be classified into the following categories by their model assumptions: (1) full parametric model which is based on linear mixed models and is estimated via maximum likelihood paradigm, such as ICC and Weighted CCC based on variance components; (2) semi-parametric model which is based on GEE, such as logistic regression modeling of the agreement proportion and extended kappa statistic based on two-stage logistic regression; and (3) nonparametric model, which is based on U-statistic, such as extended kappa and CCC based on U-statistic. The sample programs can be found under http://web1.sph.emory.edu/observeragreement/ and requested from the authors.

Table 8 Comparison of six methods on assessing agreement of longitudinal binary data.

In practice, estimating the agreement proportion has been widely used in medical research [25–27]. Besides the unstructured correlation, other correlation structures such as AR(1) and CS can also be considered. Furthermore, important covariates that may affect agreement proportions can be included in the logistic modeling. However, this method does not overcome potential limitations of general observations of agreement such as the tendency for percent agreement to be high whenever the frequency of a particular diagnostic category is very low or very high. Therefore, estimates of kappa, which take chance agreement into account, may be preferable to the agreement proportion method.

The two extended kappa-based methods serve as good alternatives to modeling only the agreement proportion. In Lipsitz’s logistic regression model for chance-corrected agreement, the offset term ensures that agreement due to chance is properly accounted for when attempting to identify covariates that are predictive of agreement. In this method, any data point with only an observation in the in-clinic record or in the diary was not included. Although results adjusting for covariates are not presented, covariates may also potentially be included in the two-stage logistics model. Besides the GEE method, which accommodates for the correlations among different time points for the same subjects, random effects models can be applied. On the other hand, Ma et al. [12] developed an approach to address missing data when modeling multiobserver kappa within a longitudinal study setting. In particular, they extended MMDP assumption for longitudinal data analysis involving a single response to a bivariate setting and integrated the inverse probability weighting approach with the theory of U-statistics. The missingness in “redness” follows MMDP and MCAR. Comparing those two extended kappa coefficients, model-based estimation requires more assumptions to correctly specify the model while the U-statistics approach is nonparametric.

CCC based on U-statistics can be applied as an extension of Lin’s CCC to responses measured repeatedly over time, or clustered by some other design. This method can handle any number of repeated measurements, and the variance can be estimated in a straightforward manner by U-statistics methodology. Furthermore, the CCC based on U-statistics is not applicable, when any missing data appear and when the design is unbalanced. It indicated that if there was one missing data point in any method at any time point, the whole subject will be removed from the dataset. On the other hand, ICC and CCC based on variance components approaches were built up with the random effects model described in eq. [3]. By eq. [4], ICC can be considered as a special case (un-weighted version) of the weighted CCC when the **D** is an identity matrix. In addition, we found that the standard error is substantially higher for both CCC_U methods than for ICC and CCC_VC approaches. It may be possible that ICC and CCC_VC underestimate the SE, so that a smaller SE means worse performance in this case. This finding is consistent with Carrasco et al. [28]. So, perhaps the VC approach, or more properly the maximum likelihood approach, gives smaller SE, but this does not always mean “better performance”. Furthermore, the ICC and CCC_VC can accommodate multiple raters and possible covariates in the model.

As explained in Section 3.4, one of the approaches Ridout et al. [21] evaluated ICC based on the linear mixed model (ANOVA) with the binary data coded as 0 and 1. In another more recent article on agreement, Zou and Donner [29] estimated ICC for binary data in the same way. Actually, it is not uncommon to use a linear mixed model to estimate variance components for binary data. In estimating those variance components, the categorical or binary nature of the response is usually ignored, and the analysis is carried out using ANOVA or mixed models. The rule-of-thumb generally applied is that the ANOVA is reasonably accurate, as long as the proportions in each of the categories of the response are not extreme. The variance components approach in Carrasco et al. [13] is similar to that of these articles, because the fixed effects and variance components are estimated from the linear mixed model by restricted maximum likelihood (REML) which gives the same estimates than ANOVA in case of balanced design.

In summary, all six statistical methods give us comparable estimates, indicating good agreement for assessing the record of redness and moderate to good agreement on tenderness between participants’ diaries and their in-clinic record. However, each approach has its own pros and cons under different situations. This article provides several alternatives for assessing the agreement with longitudinal binary data. We hope this article can inspire researchers to choose the most appropriate method to assess agreement for their own study.

## Comments (0)