In 1965 Roy Barnett, in consultation with William Youden, proposed a systematic scheme to compare methods that are still remarkably relevant after more than five decades . In the interim, numerous papers and guidelines on the topic have appeared, and broadly speaking, there are two schools of thought, one advocating scatter plots, regression and correlation, with the second promoting difference plots and analysis of differences , , , , , , . The CLSI recommends both as in fact Barnett did .
Our comments are directed towards the comparison of field methods, with the assumption that they have been validated and have known performance characteristics. We advocate a simple and consistent approach based on fundamental aspects and principles. The focus should primarily be that of common sense, “fit-for-purpose”, judicious use of statistical analysis and to some extent on “value judgements”. Results should be presented in a way that would enable a third party to critically evaluate the data and to draw their own conclusions. Wherever possible, the data and statistical parameters should be presented in the actual units of measurement. Imprecision is often presented in relative units (coefficient of variation), which has led to the misconception that precision of methods is worse at low concentrations when the opposite is in fact true when one considers the variance in actual units (standard deviation [SD]). Presenting information as ratios (percentages or logarithms) can potentially obscure the relationship, or the absence thereof, between methods. Transformation of data and elaborate statistical procedures should be used as a last resort as it creates difficult-to-grasp complexity.
It is important to realise that in a chemical measurement procedure (method), a response signal is generated that is proportional to the concentration of the analyte of interest. This signal is used to estimate the concentration via a calibration procedure. The evaluated field methods are assumed to have no systematic bias but are always accompanied by imprecision (“noise”) . Thus, the starting point is an expectation that bias is absent and that differences in responses can be explained by the imprecision of both methods. This can be assessed from a scatter plot, which is intuitively easy to interpret. The same information can be presented in a difference plot, which is simply a derivative of the former. The main advantage of difference plots is that they present the variances between the methods in a way that may be easier to relate to analytical imprecision than a plot of the residuals or the original scatter plot.
Differences between methods are due to analytical errors and therefore an appreciation of the error components is essential. Errors can be classified as systematic or random. Systematic error includes constant and proportional bias and is the result of errors in calibration. Random error consists of imprecision and sample-method error or bias . Sample-method bias, also called patient/method interaction, method-related factors, between-method analytical variation and aberrant sample bias (matrix effect), is often neglected and is the result of an error (bias) that is unique to a sample and is caused by non-specificity , , , . This error may substantially contribute to the apparent random differences between methods, and it may also affect bias, especially if the interference is predominantly one-sided. Another type of error is outliers, which are large errors that are distinct from the distribution of the majority of results. IUPAC refers to these errors as blunders, whereas Barnett called these “large discrepancies” , . Barnett recommended that these results be removed and investigated separately.
Method comparison can be viewed as a process by which the error components are characterised. Errors within acceptable limits indicate adequate agreement and that the test method is fit-for-purpose. These limits should be determined a priori and based on what is deemed clinically acceptable . Unexpectedly large differences indicate an error in one or both methods and this requires further scrutiny and may mandate more complicated statistical analysis. Hypothesis testing, where the null hypothesis states that there is no systematic difference and that the distribution of differences is due to imprecision, is straightforward.
We recommend a stepwise approach to method comparison that will identify all components of error in the majority of cases (Figure 1). The acceptable error limits that will identify the test method as fit-for-purpose should be stated beforehand (a priori). First, characterise the imprecision of each method accurately across an appropriate measuring range (Figure 1A). This information may be presented via a characteristic function of SD vs. concentration. The uncertainty is invariably heteroscedastic, whereas homoscedasticity is often assumed with statistical analysis. By limiting the data range, imprecision may approximate homoscedasticity.
Next, use a scatter plot for viewing the comparative results (Figure 1B). Singleton results should be plotted, and transformation of data should be a last resort. Use first-line statistical analysis and interpret statistical parameters in context. Trim the data by excluding extreme data points, and limit the range to suit the intended purpose of the method. Should any outliers be identified, they must be removed and investigated separately .
Determine the slope and intercept on the trimmed data by using a regression procedure that takes imprecision of both methods into account (Figure 1C). Any significant variance of the slope and intercept from the line of identity indicates a systematic bias in one or both of the methods. Correct the bias and use the adjusted results for further interrogation with a difference plot and analysis of differences (Figure 1D). It should be emphasised that the imprecision data (SD) should also be rescaled to reflect this correction before proceeding with further analysis.
The distribution of the unbiassed differences now reflects the imprecision and sample-method bias of both methods. Plotting the differences in the presence of a significant constant or proportional error between the methods and reporting a 95% limit of agreement is as meaningless as omitting an a priori statement on what the acceptable agreement limits should be. The SD of the differences (SDD) can be predicted from the imprecision parameters. The variation between the observed SDD and the predicted SDD reflects the sample-method bias and can be quantified. When the distribution of differences approximates homoscedasticity, constant limits will do. Otherwise, alternative approaches are required, i.e. bins, proportional 95% limits or transformation of the data. Identification of significant errors will require more in depth analysis.
An example of this approach is the comparison of two entirely different total CO2 methods, one directly measured vs. one calculated with the Henderson-Hasselbalch equation . The controversy about their agreement was settled by showing that the sample-method bias was insignificant. By contrast, comparing cardiac troponin I methods, sample-method error was found that was of an order of a magnitude larger than the error caused by imprecision . Interestingly, sample-method bias was ignored, whereas precision was targeted with zeal in an attempt to improve method performance, thereby popularising misnomers such as “highly sensitive” to describe troponin. Cardiac troponin provides an example of heteroscedasticity in which bins were used and, probably not ideal, differences transformed into percentages. The effect of sample-method bias is accentuated when measuring troponin in healthy individuals . When comparing two cardiac troponin I methods using samples from healthy individuals, correlation was lacking (R2 0.18) and the distribution of differences could not be explained by imprecision. In fact, the sample-method bias was the major contributor to the differences between the methods, with imprecision playing a minor role (SDD predicted 1.9 ng/L and observed 11.1 ng/L). This mostly ignored inaccuracy of cardiac troponin assays has significant clinical implications.
To summarise, the approach that involves a scatter and difference plot is in line with CLSI recommendations – with some common sense adjustments. The major points are that methods are initially assumed to have no calibration bias and that all differences can be explained by their imprecision. We recommend that difference plots and analysis of differences are performed after correction of constant and proportional error. Any significant (not fit-for-purpose) observed deviation from these assumptions is the result of one or more components of error, all of which can be characterised. The decision whether these errors and methods are acceptable will depend on a priori performance limits or judgement calls. Those errors that can be should be fixed, or at worst, be managed by changing reference intervals and decision limits.
Stöckl D, Dewitte K, Thienpont LM. Validity of linear regression in method comparison studies: is it limited by the statistical model or the quality of the analytical input data? Clin Chem 1998;44:2340–6. PubMedGoogle Scholar
Dewitte K, Fierens C, Stöckl D, Thienpont LM. Application of the Bland-Altman plot for interpretation of method-comparison studies: a critical investigation of its practice. Clin Chem 2002;48:799–801. PubMedGoogle Scholar
EP09-A3. Measurement procedure comparison and bias estimation using patient samples; Approved Guideline – Third Edition. CLSI 2013. Google Scholar
Currie LA. Nomenclature in evaluation of analytical methods including detection and quantification capabilities (IUPAC recommendations 1995). Pure Appl Chem 1995;67:1699–723. Google Scholar
Petersen PH, Stöckl D, Blaabjerg O, Pedersen B, Birkemose E, Thienpont L, et al. Graphical interpretation of analytical data from comparison of a field method with a reference method by use of difference plots. Clin Chem 1997;43:2039–46. PubMedGoogle Scholar
Ungerer JP, Marquart L, O’Rourke PK, Wilgen U, Pretorius CJ. Concordance, variance, and outliers in 4 contemporary cardiac troponin assays: implications for harmonisation. Clin Chem 2012;58:274–83. CrossrefGoogle Scholar
About the article
Published Online: 2017-11-02
Published in Print: 2017-11-27
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: None declared.
Employment or leadership: None declared.
Honorarium: None declared.