In epidemiologic and medical science, many statistical approaches have been proposed for assessing agreement among different observers or measurement methods. For categorical measurements, Cohen’s kappa  and weighted kappa  are the most popular indices of agreement. For quantitative data, a very popular unscaled agreement index is the limits of agreement proposed by Bland and Altman . More recently, the commonly used scaled agreement coefficients are the intraclass correlation coefficient (ICC) [4–7] and the concordance correlation coefficient (CCC) .
In many modern-day applications, data are often clustered, making inference difficult to perform. In addition, longitudinal studies where repeated observations are recorded for one subject by each observer at different time points have become increasingly more common. Our objective is to review and summarize the available methods that can be used to assess agreement of repeated binary outcomes. This study can serve a very useful tool whenever agreement needs to be assessed from repeated binary measurements under different scenarios. To illustrate the methods for obtaining agreement coefficients for repeated binary outcomes, we use data from the CDC anthrax vaccine adsorbed (AVA) human clinical trial , and we provide an overview of this in the following section. Details of each statistical method are presented in Section 3. Results of applying each of the statistical methods to data for redness and tenderness at the site of vaccination from the CDC AVA human clinical trial are shown in Section 4. Summary and discussion follow in Section 5.
2 Motivating example
The CDC AVA human clinical trial was conducted during 2002–2005 with participants enrolled and followed at five major U.S. vaccine research centers . This was a Phase 4, multicenter, randomized, placebo-controlled clinical trial to evaluate route change [subcutaneous (SQ) to intramuscular (IM)] and dose reduction (priming schedule of 0, 2, 4 weeks and 6, 12, 18 months and an annual booster vs reduced priming schedule of 0, 4 weeks and 6 months and a triennial booster). At each site, participants were randomly assigned to one of seven arms (TRT-8SQ, TRT-8IM, TRT-7IM, TRT-5IM, TRT-4IM, CNT-8IM, and CNT-8SQ) based on treatment (AVA vs saline placebo), route (SQ vs IM), and full/reduced AVA schedule (full0.5 mL doses at 0, 2, and 4 weeks, and 6, 12, 18, 30, and 42 months and reducedsubstituting one/more placebo doses). Details of the clinical trial and the results of the interim analysis on data collected through the first four AVA doses for the first 1,005 participants were previously published .
Participants (n1,563) received a total of eight injections of AVA or saline placebo during 43 months (Figure 1). Following each injection, participants were routinely monitored (a) using a self-reported adverse event diary for 14 days after each of the first two doses and for 28 days after all subsequent doses and (b) by a study nurse during an in-clinic interview and exam at 15–60 min (not shown in Figure 1) and 1–3 days after all injections, 14 days after the first two injections, and 28 days after injections 3–8. Eight solicited injection-site adverse events (warmth, tenderness, itching, pain, arm motion limitation, redness, swelling, and bruise) were recorded both in the clinic record and separately by the individual participant in their diary.
For our agreement methods’ comparison, we restricted the dataset to only participants from study site A (n299) and compared the agreement for two solicited adverse events, redness and tenderness, at the injection site, in the in-clinic record, and in the participant’s diary. We further limited our dataset to include only the record obtained at the participant’s in-clinic visit scheduled for 1–3 days following each vaccination, and we compared redness and tenderness indicated in the clinical record and in the participant’s diary on the exact same date. This time point was chosen to maximize the agreement of having redness and tenderness at the injection site recorded by both observers. We did not evaluate data from the early (15–60 min) in-clinic assessment, since the participants were instructed to use the diary to only record adverse events occurring later in the day after their in-clinic visit or the much later (14/28 days) in-clinic evaluations.
Table 1 presents an example of the data layout of redness. The redness and tenderness measurements of one subject were from both diary and in-clinic measurements after each of the 8 injections, consisting of 16 observations. In addition, important covariates such as treatment arm, age, and gender are also included.
The following agreement approaches were considered: (1) modeling the observed agreement with generalized estimating equation (GEE) by Coughlin et al. ; (2) extended kappa statistic with two-stage logistic regression by Lipsitz et al. ; (3) extended kappa statistic based on U-statistics by Ma et al. ; (4) ICC on repeated measurements by Carrasco et al. ; (5) CCC on repeated measurements with U-statistics by King et al. ; and (6) weighted CCC on repeated measurements with variance components by Carrasco et al. .
In the following discussion of methods, the common notation is applied. Let denote the binary readings from the ith subject, jth observer at time point t, where , denoting the subject index; j 1, 2, standing for the two observers and . For example, if , then the new defined dependent variable at time point t. Otherwise . Define . In addition, some baseline covariates were also included in some methods.
3.1 Logistic regression modeling of the agreement proportion
Coughlin et al.  introduced logistic modeling of inter-observer agreement. The dependent variable is defined to be 1, if the two raters agree and 0 otherwise. Based on the notations, the logistic regression to estimate the percent agreement, adjusting for covariates in order to obtain adjusted or subgroup-specific estimates of percent agreement can be expressed as follows:
If each of N subjects is assigned independently by two raters to one of the I categories, then the cell frequencies () along the main diagonal of the two-way contingency table represent the agreement between the raters, at time point t. The crude agreement is estimated as follows:
By applying this logistic regression approach, the proportion of agreement for particular subgroups can be estimated. Suppose there are M xplanatory variables , then the model-based agreement is
3.2 Extended Kappa statistic based on two-stage logistic regression
Although the estimation of the agreement proportion and its interpretation  is straightforward, two observers can simply agree by chance . Lipsitz et al.  proposed two-stage logistic regression to estimate kappa, and this can also be applied to repeated binary measurements. Kappa was introduced by Cohen  to assess the agreement of two methods/observers having binary readings, and it is defined as , estimated by where denotes the observed agreement and denotes the agreement expected by chance, which is also the agreement under independence.
Let which denotes the probability of the first rater having the measurements as “1”. Similarly, and , are the corresponding estimates based on the following logistic regressions.
, where is the intercept and is the coefficient of variable time, which is denoted by and similarly, . In summary, there are two steps that are considered to assess the agreement between two observers:
Use the standard logistic regression to obtain and .
Form the estimated offset,
Finally, the model is , where is the pre-specified offset, is the intercept, is the coefficient of , is the coefficient of and so on. A summary measure of how agreement differs from chance agreement, for any given covariate pattern is included by the estimated linear predictor, . Lipsitz et al.  showed that the estimate of the kappa coefficient is
Although the jackknife estimator was originally proposed for this method, in our study we applied the bootstrap standard error approach. GEE with unstructured correlation was also used. To ensure the results in the two stages were consistent, any data point with only one observation in the in-clinic record or in the diary was not included. And we only adjusted for time in all the models fitted under this approach.
3.3 Extended Kappa statistic based on U-statistics
Ma et al.  introduced a new class of kappa coefficients based on U-statistics to tackle the complexities involved in addressing missing data and other related issues arising from a multirater scenario and repeated categorical measurements. For illustration purposes, we only consider a longitudinal study with n subjects, j raters, t assessments, and a binary outcome g. The addition of index g is to create a dummy variable of each . For example, if 1, then (g1) and 0 (g0). On the other hand, if 0, then (g1) and 1 (g0). The motivation of such a notation is to accommodate the missing data structure. The estimate of kappa at time t is given by the ratio of two U-statistics:
where i and u are two subjects belonging to
By the theory of U-statistics and the delta method, the proposed estimate of kappa is proved to be consistent and asymptotically normal. Further, the U-statistics based estimate in eq.  can be modified to account for missing data:
where if the jth rater’s rating on the ith subject is observed at the tth assessment and if the rating is missing, and represents the probability of missing data. This estimate is designed for modeling kappa under two missing data patterns – missing completely at random (MCAR) and missing at random (MAR). MCAR means that the missing data are independent of both the observed and the unobserved variables. On the other hand, MAR means that given the observed data, missingness does not depend on the unobserved data.
By the theory of multivariate U-statistics, the joint distribution of the kappas assessed at multiple time points is readily derived. Inferences for longitudinal kappas can be further developed based on their joint distribution. For example, one of the research interests in practice is to identify the trend in agreement over time. In particular, a test of equal agreements (i.e. ) is proposed in this method.
When applying this method to our illustrative dataset, we made the following assumptions as the major missing data pattern in our study: (1) if the measurement was missing in the in-clinic record, then it was also missing for the participant’s diary (missing simultaneously); (2) if the record of injection-site redness was missing at one vaccine dose, then it was also missing for all the participant’s later doses, representing a monotone missing data pattern (MMDP). In fact in our study, all missingness occurred in both in-clinic and diary records simultaneously and 90% of the missing observations followed MMDP. Missing in-clinic record and diary observations were due to a participant’s missed in-clinic visit and participant’s negligence, respectively. Furthermore, the occurrence of missing data in our study did not relate to either prior or future observations. Therefore, we observed a MCAR missing pattern in our study, by which in eq.  is constant and can be estimated by the sample proportion .
3.4 ICC for binary repeated measurements
Historically, agreement between quantitative measurements has been evaluated via the ICC. Numerous versions of ICCs [4–7, 18–20] have been proposed in many areas of research by assuming different underlying analysis of variance (ANOVA) models for the situation where none of the observers is treated as the reference. The simplest ICC is defined in the following ANOVA model:
with assumptions: ; and is independent of , where , denoting the subject index; , standing for the observer index. Then, and its estimate is where and are the mean sums of squares from the one-way ANOVA model for between and within subjects, respectively. This method has been extended to binary data just coding as 0 or 1 . For binary measurement,
Besides this ANOVA estimator, Ridout et al.  also introduced other estimation methods of ICC for binary observations such as the moment estimators, the quasi-likelihood and pseudo-likelihood estimators and the maximum likelihood estimator for beta-binomial data. Furthermore, Carrasco et al.  extended ICCs to repeated measurements. Consider the following linear mixed model:
where is the overall mean, is the random subject effect assumed to be distributed as , is the fixed observer effect , is the fixed time effect , is the random subject–observer interaction effect assumed to be distributed as , is the random subject–time interaction effect assumed to be distributed as , is the fixed observer–time interaction effect, and is the random error effect assumed to be distributed as when is the vector of residuals of each subject. All the effects of the model are assumed to be independent. This ANOVA model is a special case of linear mixed models. Thus, the appropriate expression of the ICC for measuring agreement between observers is
ICC2 is estimated by replacing the variance components in eq.  by their corresponding estimates obtained using restricted maximum likelihood (REML) estimation.
3.5 CCC based on U-statistics
The CCC is commonly used for assessing agreement for continuous outcomes. It was first published by Lin  for the simplest case where each of two raters makes one reading per subject. Lin’s CCC is defined as follows: assume that the observations are from a bivariate distribution with mean vector (, ) and variance covariance matrix , the Lin’s CCC between two observers and is proposed as
where is the Pearson correlation coefficient between two observers. King et al.  introduced a generalized CCC for both continuous and categorical data. Furthermore, King et al. [22, 23] reported the extension of CCC to repeated measurements, mainly with repeated continuous data.
Let the elements of the vector Y1 () represent the tth repeated measure on the ith subject for the repeated measurements on the first observer. Let the elements of the vector Y2 () represent the tth repeated measure on the ith subject for measurements on the second observer. Assume that the elements of the vector [Y1, Y2] are selected from a multivariate normal population with mean vector , and covariance matrix , which consists of the following four matrices: , , , and
Extending from the derivation of Lin’s CCC shown in eq. , we can then construct a repeated measures CCC as
Following the extension in King et al. , eq.  has been extended to repeated categorical outcomes as:
Weight 1: where is the non-missing proportions when ts and when .
Weight 2: where when t = s and when .
In the first set of weights, we used the actual non-missing proportion to give more weights to the time points with lower missing proportions. It is reasonable to assign more weight to the time point with more information. While the second set of weights gave us more weight to the earlier time points. The data from the first measurements are possibly more reliable, since there are more data. Therefore, more weight was assigned to the earlier time points in the second weight. A basic consideration for statistical inference concerning is to recognize that the estimator can be expressed as a ratio of functions of U-statistics. CCC based on U-statistics is not applicable when missing data occur. In our study, if any observation for a participant was missing, then we excluded all that participant’s data.
3.6 Weighted CCC based on variance components
Carrasco et al.  proposed a CCC for repeated binary measurements through the appropriate specification of the ICC from a variance components linear mixed model. Combining the notations in Sections 3.4 and 3.5, , where D is an identity matrix, the expression of CCC under model (1) can be reduced to . In particular, the case of D as an identity matrix where if t s and 0 otherwise, then and the CCC becomes as shown in eq. . Besides the two weights, compound symmetry (CS) and first-order auto correlation (AR(1)) structures were considered with the CCC variance components method.
Of the total trial’s enrollment, 299 participants were enrolled at study site A. The mean age of these participants was 42.2 years with standard deviation 10.1 years. Fifty-one percent (155/299) of participants were female. Participants by treatment arms were 48 (16.1%) in 8 IM; 50 (16.7%) in 7 IM; 51 (17.1%) in 5 IM; 52 (17.4%) in 4 IM; 24 (8.0%) in the 8 IM placebo arm, and the remaining 24 (8.0%) in the 8 SQ placebo arm.
Tables 2 and 3 present the result of modeling agreement proportions of redness and tenderness on the 299 participants enrolled at study center A. A univariate model with only time effect and a multivariate model with age, gender, treatment arm, and time effect were considered. In both models, all data at each time point were fitted using a single model. In the univariate model, only time effect was included with GEE and unstructured correlation. The significance of the time effect indicated that agreement proportions changed significantly across visits (p0.003 for redness and p0.005 for tenderness). In the multivariable model for redness, age (p0.02), visit time (p0.002), and treatment arm (p < 0.0001) were found to be significantly associated with agreement and for tenderness, only visit time (p0.005) and arm (p < 0.0001) were significant. Both univariate and multivariable models showed high agreement proportions. After adjusting for age, gender, and treatment arm, the overall agreement proportions were almost all above 90% for redness and above 85% for tenderness. However, two observers can simply agree with each other by chance. The observed agreement proportion does not tell us the complete story. Therefore, we used the modified kappa statistic which takes chance agreement into account. Extended kappa coefficients based on two-stage logistic regression models and U-statistics are presented in Tables 4 and 5, respectively. For redness, kappa statistic based on two-stage logistic regression was 0.719 with 95% CI (0.597, 0.833) at AVA dose 1, while it was 0.630 with 95% CI (0.522, 0.738) at the AVA dose 8. Kappa coefficient based on U-statistics gave us similar results. Starting with a value of 0.720 with 95% CI (0.591, 0.849) at AVA dose 1 kappa decreased to 0.646 with 95% CI (0.521, 0.771) at AVA dose 8. Kappa statistics were not significantly different across time (p0.8) for redness. A kappa value ranging from 0.2 to 0.4 indicates fair agreement, 0.4 to 0.6 means moderate agreement, 0.6 to 0.8 means good agreement, and greater than 0.8 means excellent agreement . Therefore, good agreement was achieved when comparing the records of injection-site redness from both patients’ diaries and their in-clinic records. On the other hand, the kappa coefficients of tenderness ranged from 0.523 to 0.667 based on the two-stage logistic regression (Table 5). Kappa based on U-statistic gave us comparable results except for the seventh time point where U-statistic gave us 0.627, and kappa based on two-stage logistic regression was 0.584. Kappa statistics were not significantly different across time (p0.7) for tenderness.
Table 6 combines the results of ICC with CS and AR(1) correlation structures, CCC based on U-statistics with two weights and CCC based on variance components with CS and AR(1) correlations and two weights specified in Section 3.5. Here, weight 1 based on non-missing proportions is (0.98, 0.95, 0.95, 0.91, 0.90, 0.89, 0.82, 0.77) and weight 2 is simply (8, 7, 6, 5, 4, 3, 2, 1). Overall agreement across eight doses is presented for each method, and CCC based on U-statistics with weight 2 gave us the highest estimate 0.7090 with 95% CI (0.6614, 0.7510) for redness. Furthermore, more discrepancy was observed for tenderness applying different methods and weighting schemes. ICC and CCC based on variance components with weight 2 gave us almost identical results. However, CCC based on variance components applying weight 1 gave us much higher agreement of tenderness between in-clinic interview and patients’ diaries. For example, CCC_VC with AR(1) was 0.6861 with weight 1 and 0.5943 with weight 2. CCC based on U-statistic gave us comparable results no matter which weight we choose from (Table 7).
In this study, six approaches for measuring agreement on repeated binary outcomes were reviewed and applied to our illustrative dataset. In general, it is apparent that those methods should be applied under different situations. In Table 8, we list the details, main characteristics of each method and when we recommend using each of them. Generally speaking, those six approaches can be classified into the following categories by their model assumptions: (1) full parametric model which is based on linear mixed models and is estimated via maximum likelihood paradigm, such as ICC and Weighted CCC based on variance components; (2) semi-parametric model which is based on GEE, such as logistic regression modeling of the agreement proportion and extended kappa statistic based on two-stage logistic regression; and (3) nonparametric model, which is based on U-statistic, such as extended kappa and CCC based on U-statistic. The sample programs can be found under http://web1.sph.emory.edu/observeragreement/ and requested from the authors.
In practice, estimating the agreement proportion has been widely used in medical research [25–27]. Besides the unstructured correlation, other correlation structures such as AR(1) and CS can also be considered. Furthermore, important covariates that may affect agreement proportions can be included in the logistic modeling. However, this method does not overcome potential limitations of general observations of agreement such as the tendency for percent agreement to be high whenever the frequency of a particular diagnostic category is very low or very high. Therefore, estimates of kappa, which take chance agreement into account, may be preferable to the agreement proportion method.
The two extended kappa-based methods serve as good alternatives to modeling only the agreement proportion. In Lipsitz’s logistic regression model for chance-corrected agreement, the offset term ensures that agreement due to chance is properly accounted for when attempting to identify covariates that are predictive of agreement. In this method, any data point with only an observation in the in-clinic record or in the diary was not included. Although results adjusting for covariates are not presented, covariates may also potentially be included in the two-stage logistics model. Besides the GEE method, which accommodates for the correlations among different time points for the same subjects, random effects models can be applied. On the other hand, Ma et al.  developed an approach to address missing data when modeling multiobserver kappa within a longitudinal study setting. In particular, they extended MMDP assumption for longitudinal data analysis involving a single response to a bivariate setting and integrated the inverse probability weighting approach with the theory of U-statistics. The missingness in “redness” follows MMDP and MCAR. Comparing those two extended kappa coefficients, model-based estimation requires more assumptions to correctly specify the model while the U-statistics approach is nonparametric.
CCC based on U-statistics can be applied as an extension of Lin’s CCC to responses measured repeatedly over time, or clustered by some other design. This method can handle any number of repeated measurements, and the variance can be estimated in a straightforward manner by U-statistics methodology. Furthermore, the CCC based on U-statistics is not applicable, when any missing data appear and when the design is unbalanced. It indicated that if there was one missing data point in any method at any time point, the whole subject will be removed from the dataset. On the other hand, ICC and CCC based on variance components approaches were built up with the random effects model described in eq. . By eq. , ICC can be considered as a special case (un-weighted version) of the weighted CCC when the D is an identity matrix. In addition, we found that the standard error is substantially higher for both CCC_U methods than for ICC and CCC_VC approaches. It may be possible that ICC and CCC_VC underestimate the SE, so that a smaller SE means worse performance in this case. This finding is consistent with Carrasco et al. . So, perhaps the VC approach, or more properly the maximum likelihood approach, gives smaller SE, but this does not always mean “better performance”. Furthermore, the ICC and CCC_VC can accommodate multiple raters and possible covariates in the model.
As explained in Section 3.4, one of the approaches Ridout et al.  evaluated ICC based on the linear mixed model (ANOVA) with the binary data coded as 0 and 1. In another more recent article on agreement, Zou and Donner  estimated ICC for binary data in the same way. Actually, it is not uncommon to use a linear mixed model to estimate variance components for binary data. In estimating those variance components, the categorical or binary nature of the response is usually ignored, and the analysis is carried out using ANOVA or mixed models. The rule-of-thumb generally applied is that the ANOVA is reasonably accurate, as long as the proportions in each of the categories of the response are not extreme. The variance components approach in Carrasco et al.  is similar to that of these articles, because the fixed effects and variance components are estimated from the linear mixed model by restricted maximum likelihood (REML) which gives the same estimates than ANOVA in case of balanced design.
In summary, all six statistical methods give us comparable estimates, indicating good agreement for assessing the record of redness and moderate to good agreement on tenderness between participants’ diaries and their in-clinic record. However, each approach has its own pros and cons under different situations. This article provides several alternatives for assessing the agreement with longitudinal binary data. We hope this article can inspire researchers to choose the most appropriate method to assess agreement for their own study.
The authors thank Dr. David G. Kleinbaum and other colleagues at CDC for their helpful review of the manuscript. The authors thank Dr. Lawrence Lin for his very insightful suggestions for reviewing this manuscript. The authors also thank Kamesha Smith from ISO, CDC for drawing Figure 1.
Conflict of interest
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention (CDC). Mention of a product or company name does not constitute endorsement by the CDC. The protocol for this study was approved by an Institutional Review Board of the CDC. Dr. Gregory Poland is the chair of a Safety Evaluation Committee for novel investigational vaccine trials being conducted by Merck Research Laboratories. Dr. Poland offers consultative advice on vaccine development to Merck & Co. Inc., CSL Biotherapies, Avianax, Sanofi Pasteur, Dynavax, Novartis Vaccines and Therapeutics, PAXVAX Inc, and Emergent Biosolutions. These activities have been reviewed by the Mayo Clinic Conflict of Interest Review Board and are conducted in compliance with Mayo Clinic Conflict of Interest policies. This research has been reviewed by the Mayo Clinic Conflict of Interest Review Board and was conducted in compliance with Mayo Clinic Conflict of Interest policies. Rest of the authors have conflicts of interest to declare.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37–46. [Crossref]
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307–10. [PubMed]
Ebel RL. Estimation of the reliability of ratings. Psychometrika 1951;16:407–24. [Crossref]
Haggard EA. Intraclass correlation and the analysis of variance. New York: Dryden Press, 1958.
Marano N, Plikaytis BD, Martin SW, et al. and Anthrax Vaccine Research Program Working Group. Effects of a reduced dose schedule and intramuscular administration of anthrax vaccine adsorbed on immunogenicity and safety at 7 months: a randomized trial. JAMA 2008;300:1532–43. [Web of Science]
Lipsitz SR, Parzen M, Fitzmaurice GM, Klar N. A two-stage logistic regression model for analyzing inter-rater agreement. Psychometrika 2003;68:289–98. [Crossref]
Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford: Clarendon Press, 1994.
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13–22. [Crossref]
Liang KY, Zeger SL, Qaqish B. Multivariate regression analysis for categorical data. J R Stat Soc Ser B 1992;54:3–40.
Muller R, Buttner P. A critical discussion of intraclass correlation coefficient. Stat Med 1984;13:2465–76.
Eliasziw M, Young, SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: Using goniometric measurements as an example. Phys Ther 1994;74:777–88. [PubMed]
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996;1:30–46. [Crossref]
Phillips KA, Milne RL, Buys S, Friedlander ML, Ward JH, McCredie MRE, et al. Agreement between self-reported breast cancer treatment and medical records in a population-based breast cancer family registry. J Clin Oncol 2005;23:4679–86. [Crossref]
Phillip D, Lyngberg AC, Jensen R. Assessment of headache diagnosis: a comparative population study of a clinical interview with a diagnostic headache diary. Cephalalgia 2007;27:1. [Crossref] [Web of Science] [PubMed]
Tannenbaum C, Brouillette J, Corcos J. Rating improvements in urinary incontinence: do patients and their physicians agree? Age Aging 2008;37:379–83. [Crossref]
Carrasco JL, Jover L, King T, Chinchilli VM. Comparison of concordance correlation coefficient estimating approaches with skewed data. J Biopharm Stat 2007;17:673–84. [PubMed] [Web of Science] [Crossref]