The Bland–Altman method has been widely used for assessing agreement between two methods of measurement. However, it remains unsolved about sample size estimation. We propose a new method of sample size estimation for Bland–Altman agreement assessment. According to the Bland–Altman method, the conclusion on agreement is made based on the width of the confidence interval for LOAs (limits of agreement) in comparison to predefined clinical agreement limit. Under the theory of statistical inference, the formulae of sample size estimation are derived, which depended on the pre-determined level of α, β, the mean and the standard deviation of differences between two measurements, and the predefined limits. With this new method, the sample sizes are calculated under different parameter settings which occur frequently in method comparison studies, and Monte-Carlo simulation is used to obtain the corresponding powers. The results of Monte-Carlo simulation showed that the achieved powers could coincide with the pre-determined level of powers, thus validating the correctness of the method. The method of sample size estimation can be applied in the Bland–Altman method to assess agreement between two methods of measurement.
The original article by Bland and Altman  which proposed the method of agreement analysis has received more than 30,000 citations in the biomedical literature and has increased in usage in recent years. Nature asked Thomson Reuters, which now owns the SCI, to list the 100 most highly cited papers of all time. The third most frequently cited statistics paper (number 29) is a 1986 publication by British statisticians Martin Bland and Douglas Altman who proposed a technique – now known as the Bland–Altman plot . It has been stressed that estimates of limits of agreement (LOAs) should be accompanied with confidence intervals. However, they found that “confidence intervals are seldom quoted” in reports of method comparison studies [3–5]. A reporting standards for Bland–Altman agreement analysis in laboratory research found that the CI limits of LOAs were reported in only 6 % of 50 studies published later than 2012 . Many researchers forget that Bland and Altman presented the limits of agreement as a reference interval only. The LOAs do not guarantee coverage on the range of potential differences between the two measurements and cannot be used directly for statistical inference. Just as Hamilton and Stamey pointed out, the Bland–Altman limits of agreement by themselves provide only a reference interval and should never be used as the determining factor to conclude agreement between two methods . They informed that future researchers should take this variability into account and always provide confidence intervals when using the LOAs approach. Like any other clinical trials, it is essential to determine the sample size for method comparison studies . Although Bland has given the sample size for a study of agreement between two methods of measurement which were available from his website , the sample size was determined without considering the power of the statistical procedure and could not guarantee the power of test. Sample size calculations were performed in only 30 % of publications reviewed . Bland and Altman hoped to find time to publish some of these, for example, on sample size estimation for measurement method agreement studies, but up to now it is still not solved satisfactorily .
In this paper we propose a method to calculate sample size for Bland–Altman method, and Monte-Carlo simulations are used to validate the correctness of the method. In Section 2, we introduce the assumptions and theory of Bland–Altman method. Section 3 focuses on the derivation of the new method of sample size estimation for Bland–Altman method. In Section 4, Monte-Carlo simulation is used to obtain the corresponding powers. The results of Monte-Carlo simulation showed that the achieved powers could coincide with the pre-determined level of powers, thus validating the correctness of the method. In Section 5, we show a clinical worked example from a set of measured data of free prostate specific antigen (FPSA), which is often used to evaluate the presence of prostate cancer and other prostate disorders. We give concluding remarks in Section 6.
Suppose that the measurements of two methods are made on each of n subjects drawn from some population of interest. Suppose further that the two measurements, and respectively and the difference, for subject i (i=1,2, …, n)
The important first step of the Bland–Altman method is to plot the data and to check its pattern and distribution. The differences for the two methods are plotted against their means and if the data are well-behaved, then construction of the various limits and interpretation of the data is simple and straightforward. The assumptions of the limits of the agreement method are that the differences values resulting from two measurements should have an approximately normal distribution, constant variance of the differences, and no proportional bias . Proportional bias is present when the differences increase or decrease in proportion to the average values .
2 LOAs and confidence interval estimation
Suppose difference variable is a random variable which follows a normal distribution with mean μ and variance σ2. It is well-known that 100(1−γ) % of the population lies between . In practice, both μ and σ are unknown and have to be estimated. We take and as estimators of μ and σ2 respectively.
The LOAs can be calculated as
where is the cumulative 100(1−γ/2)% percentile of a standard normal distribution, is the standard deviation of the differences, is the upper limit value of LOA, and is the lower limit value. 95 % LOAs are the most common, which are mean minus 1.96 standard deviations and mean plus 1.96 standard deviations respectively. These limits are expected to contain 95 % of paired differences between measurements by the two methods.
It is important to consider confidence interval of LOAs . The confidence interval estimation of LOAs derived by Bland and Altman can be calculated as 
We can obtain the upper confidence limit of upper limit of the LOAs and the lower confidence limit of lower limit of the LOAs, where n is the sample size. Generally, we set and as 0.05. If the 95 % confidence interval for the 95 % LOAs comes within the pre-defined agreement limits which are clinically acceptable, the two methods agree sufficiently to fulfil the agreement requirements. In fact, the correspondence between confidence intervals for LOAs and hypotheses tests here is identical. Providing A is the lower limit and B is the upper limit of LOAs of population differences, we can construct the following simultaneous hypotheses: H01 is A < –δ, H11 is A ≥ –δ and H02 is B > δ, H12 is B ≤ δ. When the two null hypotheses are rejected simultaneously, the two measurements would be inferred to agree. The hypotheses of Bland–Altman method are quite similar to the equivalence .
3 Sample size formulae
Considering the confidence interval estimation of LOAs has symmetry of μ and –μ (μ ≥ 0) and the sample size estimations of these two situations should be the same, we just discuss the situation when μ ≥ 0. According to the statistical inference principle of Bland–Altman limits of agreement, we can separate total type I error into two parts which are both . Similarly, we can separate total type II error into two parts. One is the first type II error of the upper limit value of LOAs and the other is the second type II error of the lower limit value of LOAs (Figure 1.) .
We can get a direct estimate of and :
where is defined as the cumulative density function of standard normal distribution , is the standard error of the lower limit or upper limit of LOAs, , is the maximum allowable difference that can be accepted clinically, it needs to be defined in advance.
According to the statistical distribution theory, it is best to calculate the type II error under the assumption of a non-central t-distribution , that is:
where denotes the cumulative distribution function of a Student’s non-central t-distribution with degrees of freedom and non-centrality parameter .
The non-centrality parameters and are non-central parameters defined as
We can get an estimate of the power:
When , the sample size calculation can be written as follows:
where is defined as the inverse of a Student’s non-central t-distribution.
In eq. (6), is related to sample size , we need to use iterative method to calculate sample size. Firstly we replace non-central t-distribution quantile with standard normal distribution quantile to obtain the initial value , and then iterate step-by-step until reaches a stable value.
When , we firstly calculate by eq. (6) to obtain an initial value , then calculate by eq. (5) to achieve the power. If the estimated power is close enough to the pre-specified power then is the sample size that we want to estimate. Otherwise we make equal to and judge whether the estimated power is close enough the pre-specified power. Repeat the procedure above until be closest to but greater than the pre-specified power. Table 1 summaries reasonable estimates of the sample size using eqs (5) and (6) for various standardized difference limits , different standardized agreement limits , and different type II error assuming that the data are well-behaved.
Table 1 can be a reference for clinical researchers to estimate the sample size in the agreement assessment trial between two methods of measurement.
Monte-Carlo simulation studies with 10,000 replicates were performed to examine the validity and correctness of the proposed formulae for estimating sample size by calculating empirical powers. Simulation data were generated on the basis of normal distribution by considering typical situations under different parameter settings. If the achieved power is very close to the pre-specified power, then it could be proved that our formulae can estimate the reasonable sample size.
As claimed above, simulating corresponding powers is easy: Firstly, we define the mean of differences , the standard deviation of differences and the pre-defined clinical agreement limit . And then we estimate the sample size with eqs (5) and (6), and calculate the 95 % confidence interval for the 95 % LOAs. If they lie within the pre-defined clinical agreement limits , then we draw a conclusion that the two methods agree, otherwise disagree. We repeat the procedure above 10,000 times and compute the times that draw agreement conclusion. The value of is the achieved power. Table 2 presents the achieved power for different parameter settings corresponding to the Table 1.
Table 2 indicates that the achieved powers are generally close to the pre-specified power of 80 % or 90 %. It shows that the formulae give reasonable estimates of the sample size using eqs (5) and (6) for various parameter settings.
Bland has given the sample size for a study of agreement between two methods of measurement which were available from his website . In the 1986 Lancet paper they gave a formula for the confidence interval for the 95 % limits of agreement. The standard error of the 95 % limit of agreement is approximately root , where s is the standard deviation of the differences between measurements by the two methods and is the sample size. The confidence interval is the estimate of the limit, plus or minus , plus or minus 1.96 standard errors, then the sample size can be worked out.
We set , , , , , and pre-specified power = 80 %. Figure 2 shows the sample sizes and powers of B-A method and new method under different parameter settings. With the Bland–Altman method, the sample size is calculated without considering the power of the statistical procedure, and so the probability of obtaining the required width is only 0.50. With the new method, the achieved power is generally close to the pre-specified power of 80 %.
The number of subjects required in the Bland–Altman method proposed by Bland is determined on the basis of the expected width of a confidence interval. It fails to explicitly consider the probability of achieving the desired interval width and may thus provide sample sizes that are too small to have enough power. However, the new method is more appropriate, because it can ensure an adequate probability of achieving the desired precision.
We show a clinical worked example from a set of measured data of free prostate specific antigen (FPSA), which is often used to evaluate the presence of prostate cancer and other prostate disorders. AIA-1800 and I2000 methods were used to measure the FPSA. In the process of measurement, a same random sequence of sample was used in the two instruments . Through a pre-experiment we get the mean and standard deviation of differences between AIA-1800 and I2000 methods are 0.001167 mmol/l and 0.001129 mmol/l respectively. Defining , , , we can calculate that a sample size of 83 would be needed to provide 80 % power to assess agreement between two methods of measurement. Monte-Carlo simulation is used to obtain the corresponding power which is equal to 80.51 %, closely to the pre-defined power (80 %).
Based on the statistical inference principle and mathematical distribution theory, we have derived the calculating formula of sample size for Bland–Altman method under different parameter settings. For the sake of convenience, we have given a set of table which can be easily find out the sample size for different standardized difference limits , standardized agreement limits , and type error under two-sided . Both α and β should be considered to have sample size large enough to ensure that the half width of a 100(1−α)% confidence interval is no larger than a pre-specified width with a pre-specified assurance probability 100(1−β)%. We carried out Monte-Caro simulation studies to validate the correctness of the proposed method. The simulation results reveal that the achieved powers could coincide with the pre-determined level of powers, thus validating the correctness of the formulae.
It is important to be aware of the pre-specified clinically acceptable agreement limits. As with equivalence or non-inferiority clinical trials, the clinical agreement limits need to be determined in advance by clinical researchers and biostatistician. Defining these agreement limits may be a difficult aspect in designing the measurement comparison studies, because they depend upon not only the clinical scenario but also other variables. Nevertheless, an attempt must be made to define them; a Delphi survey (opinion from experts) may be used to design the study. This survey is a group facilitation technique, which is an iterative multistage process designed to transform an opinion into group consensus .
Lin et al.  had discussed some issues about sample size using the tolerance interval; however, there are some deficiencies. First of all, in Lin’s study, the hypothesis of the sample size calculating method is provided just under , which is not considered for . In fact, the two measurements may not be perfectly consistent , but we still believe their consistency as the population difference within a certain acceptable range (δ). In Lin’s paper, it can be seen that the simulated power is consistently less than the pre-specified power for all design specifications, so the sample sizes are much under-estimated.
Hahn and Meeker  defined a tolerance interval that is an interval that one can claim to contain at least a specified proportion, p, of the population with a specified degree of confidence, 100(1–α)% and provided the sample size estimation about the tolerance interval. We can see that their sample size estimation is based on the desired precision without considering type II error (β) or power, and just addresses the frequently asked question “How large a sample do I need to obtain a confidence interval?” Although our confidence interval of LOAs is similar to their tolerance interval, the theories and the procedures of sample size estimation are totally different. Our method of sample size estimation is derived not only on the pre-determined level of α but also on the β.
Recently, studies of the agreement between two instruments or clinical tests have proliferated in ophthalmic literature. McAlinden et al. used a method of sample size calculation for agreement studies on the basis of method proposed by Bland . The sample size was calculated without considering the power of the statistical procedure, so the probability of obtaining the required width was only 0.50 . During the study design stage, considering the power in sample size calculations could lead to expected conclusions under the predetermined power level. Cesana et al. provided another sample size estimation required for demonstrating a Pearson correlation coefficient between the differences and the means of the measures , and we think this method is unreasonable. Actually, the correlation coefficient given by Cesana reflected the proportional bias. As we know, one of the assumptions about the application of the Bland–Altman method is no proportional bias. Without fulfillment of the assumption, this method would not be applicable.
There are some limitations to this study. Our sample size formulae are just appropriate for the data which are well-behaved. If the data behavior is not very well, such as non-normality or non-constant variance of the differences (heteroscedasticity) and proportional bias, the formulae are not suitable to solve the problem of estimating sample size.
We are grateful for the constructive comments of Dr. Jian-Jun Yang. We also thank the editors and the anonymous reviewers for valuable comments that have helped us significantly improved our manuscript.
This study was funded by a grant from the National Natural Science Foundation of China (No.81473066).
1. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307–310.10.1016/j.ijnurstu.2009.10.001Search in Google Scholar
2. Noorden RV, Mahen B, Nuzzo R. The top 100 papers. Nature 2014;514:550–553.10.1038/514550aSearch in Google Scholar PubMed
3. Bland JM, Altman DG. Agreed statistics measurement method comparison. Anesthesiology 2012;116:182–185.10.1097/ALN.0b013e31823d7784Search in Google Scholar PubMed
4. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135–160.10.1177/096228029900800204Search in Google Scholar PubMed
5. Bland JM, Altman DG. Appling the right statistics: analyses of measurement studies. Ultrasound Obst Gyn 2003;22:85–93.10.1002/uog.122Search in Google Scholar PubMed
6. Chhapola V, Kanwal SK, Brar R. Reporting standards for Bland–Altman agreement analysis in laboratory research: a cross-sectional survey of current practice. Ann Clin Biochem 2015;52:382–386.10.1177/0004563214553438Search in Google Scholar PubMed
7. Hamilton C, Stamey J. Using Bland−Altman to assess agreement between two medical devices-don’t forget the confidence intervals! J Clin Monit Comput 2007;21:331–333.10.1007/s10877-007-9092-xSearch in Google Scholar PubMed
8. Bella ML, Teixeira-Pintoc A, McKenzied JE, Oliviere J. A myriad of methods: calculated sample size for two proportions was dependent on the choice of sample size formula and software. J Clin Epidemiol 2014;67:601–605.10.1016/j.jclinepi.2013.10.008Search in Google Scholar PubMed
9. Bland JM. How can I decide the sample size for a study of agreement between two methods of measurement? Available at: http://www-users.york.ac.uk/~mb55/meas/sizemeth.htmAccessed: 15 Aug 2015.Search in Google Scholar
10. Woodman RJ. Bland−Altman beyond the basics: creating confidence with badly behaved data. Clin Exp Pharmacol Physiol 2010;37:141–142.10.1111/j.1440-1681.2009.05320.xSearch in Google Scholar PubMed
11. Ludbrook J. Confidence in Altman-Bland plots: a critical review of the method of differences. Clin Exp Pharmacol Physiol 2010;37:143–149.10.1111/j.1440-1681.2009.05288.xSearch in Google Scholar PubMed
12. Stockl D, Cabaleiro DR, Uytfanghe KV, Thienpont LM. Interpreting method comparison studies by use of the Bland−Altman plot: reflecting the importance of sample size by incorporating confidence limits and predefined error limits in the graphic. Clin Chem 2004;50:2216–2218.10.1373/clinchem.2004.036095Search in Google Scholar PubMed
13. Julious SA. Sample size for clinical trials with Normal data. Stat Med 2004;23:1921–1986.10.1002/sim.1783Search in Google Scholar PubMed
14. Forbes C, Evans M, Hastings N, Peacock B. Statistical distributions, 4th ed . Hoboken: John Wiley & Sons, 2011:187–188.10.1002/9780470627242Search in Google Scholar
15. Zhou YH, Zang JJ, Wu MJ, Xu JF, He J. Allowable total error and limits for erroneous results (ATE/LER) zones for agreement measurement. J Clin Lab Anal 2011;25:83–89.10.1002/jcla.20437Search in Google Scholar PubMed PubMed Central
16. Hasson F, Keeney S, Mckennna H. Research guidelines for the Delphi survey technique. J Adv Nurs 2000;32:1008–1015.10.1046/j.1365-2648.2000.t01-1-01567.xSearch in Google Scholar
17. Lin SC, Whipple DM, Ho CS. Evaluation of statistical equivalence using limits of agreement and associated sample size calculation. Commun Stat Theor Methods 1998;27:1419–1432.10.1080/03610929808832167Search in Google Scholar
18. Hahn GJ, Meeker WQ. Statistical intervals – a guide for practitioners. New York: John Wiley & Sons, 1991:150–167.10.1002/9780470316771.ch9Search in Google Scholar
19. McAlinden C, Khadka J, Pesudovs K. Statistical methods for conducting agreement (comparison of clinical tests) and precision (repeatability or reproducibility) studies in optometry and ophthalmology. Ophthalmic Physiol Opt 2011;31:330–338.10.1111/j.1475-1313.2011.00851.xSearch in Google Scholar PubMed
20. Cesana BM, Antonelli P. Agreement analysis: further statistical insights. Ophthalmic Physiol Opt 2012;32:436–440.10.1111/j.1475-1313.2012.00916.xSearch in Google Scholar PubMed
© 2016 Walter de Gruyter GmbH, Berlin/Boston