This tutorial gives an introduction into statistical methods for diagnostic medicine. The validity of a diagnostic test can be assessed using sensitivity and specificity which are defined for a binary diagnostic test with known reference or gold standard. As an example we use Procalcitonin with a cut off value ≥ 0.5 g/L as a test and Sepsis-2 criteria as a reference standard for the diagnosis of sepsis. Next likelihood ratios are introduced which combine the information given by sensitivity and specificity. For these measures the construction of confidence intervals is demonstrated. Then, we introduce predictive values using Bayes’ theorem. Predictive values are sometimes difficult to communicate. This can be improved using natural frequencies which are applied to our example. Procalcitonin is actually a continuous biomarker, hence we introduce the use of receiver operator curves (ROC) and the area under the curve (AUC). Finally we discuss sample size estimation for diagnostic studies. In order to show how to apply these concepts in practice we explain how to use the freely available software R.
Worldwide, sepsis and its sequelae still remain a frequent cause of acute illness and death in patients with community and nosocomial acquired infections . Sepsis may be seen as systemic inflammatory response due to infection. However, a gold standard for the proof of infection is missing. Depending on prior antibiotic therapy, bacteremia is found only in approximately 30% of patients with sepsis. Furthermore, early clinical signs of sepsis, like fever, tachycardia, and leucocytosis, are unspecific and overlap with signs also seen in a multitude of systemic inflammatory response syndromes (SIRS) in the absence of infection, especially in surgical patients. Other signs, such as arterial hypotension, thrombocytopenia, or elevated lactate levels indicate, too late, the progression to organ dysfunction . Thus, delay in diagnosis and treatment of sepsis causes increased mortality.
In sepsis numerous humoral and cellular systems are activated, followed by a release of a multitude of mediators and other molecules that mediate the host response to infection. Several potential diagnostic indicators measured in the bloodstream have been evaluated for their clinical ability to assess the diagnosis and severity of sepsis. One of these, the 116 amino acid polypeptide procalcitonin (PCT) is frequently used when it comes to identify bacterial infections.
In this tutorial we will use Procalcitonin as an example for the use of statistics in diagnostic medicine using data from a study by Ljungstroem et al. (2017) . Also we will show how to use the freely available software R  for the necessary calculations.
Sensitivity and specificity of a diagnostic test
Sensitivity and specificity are key parameters when evaluating the validity of a binary diagnostic test  which requires knowledge of a reference or gold standard which denotes the disease status D + if sepsis is present and D − otherwise. The potential outcomes of a 2 × 2 table showing the disease status D in the columns and test results (T) in the rows are shown in Table 1. The establishment of such a reference standard is difficult when diagnosing sepsis [6, 7]. For our example we will apply the Sepsis-2 criteria as used by Ljungstroem et al. (2017) i.e. verified bacterial infection and systemic inflammatory response syndrome (SIRS). As diagnostic test we apply PCT ≥0.5 g/L indicating a positive test result (T +) and values <0.5 g/L indicating a negative test result (T −).
|Disease present D +||Disease absent D −|
|Test positive T +||True positive (TP)||False positive (FP)|
|Test negative T −||False negative (FN)||True negative (TN)|
When evaluating a diagnostic test a population of diseased persons and a population of healthy individuals is considered. Since no test is perfect a 2 × 2 table is constructed as shown in Table 1 which shows the potential outcomes of a diagnostic test. In a perfect world a diagnostic test would identify all diseased person as ill. That is, we would have only true positives (TP). In our example a patient who suffers from sepsis according to the Sepsis-2 criteria and the corresponding PCT-value of that patient is ≥ 0.5 g/L denotes a true positive result. From Table 1 we can define the sensitivity (sens) of our diagnostic test . This implies the probability to identify diseased persons correctly using a PCT level with at least 0.5 g/L as diagnostic test.
Likewise, a non-diseased person should be identified correctly as well. This leads to the specificity (spec) of a diagnostic test given by .
Based on Table 2 we obtain 296 true positives (TP) and a total of 667 diseased persons (TP + FN) according to Sepsis-2 criteria. Thus, the sensitivity is given as . Likewise the specificity is, given as the ratio of 664 true negatives (TN) with a total of 870 healthy persons. This leads to a specificity of .
|PCT ≥0.5 g/L||296||229||525|
|PCT <0.5 g/L||371||641||1,012|
In order to quantify statistical uncertainty sensitivity and specificity should be reported together with a confidence interval. Usually a 95% interval is applied. Statistically speaking sensitvity and specificity are proportions and confidence intervals can be constructed accordingly. The estimate of a proportion p is given as , k=number of events and n=total number. The corresponding standard error σ is given by
Then a 95% confidence interval (CI) is given by
For sensitivity a 95% CI is constructed as follows with
For our data we obtain sens = . The 95% CI is given by . For the specificity equal to 0.74 we obtain a 95% CI (0.71, 0.77). There a are several ways to construct a confidence interval for a binomial proportion with different statistical properties .
In order to create an overall measure of diagnostic performance frequently likelihood ratios are considered . These have the advantage that they combine the information obtained from sensitivity and specificiy. One way to do this is the positive likelihood ratio . The LR+ summarizes how many times more likely patients with the disease are to have that particular result than patients without the disease. More formally this is the ratio of the proportion of true positives divided by the proportion of false positives. A LR+>1 indicates that the test result is associated with the presence of the disease. For our data we obtain LR+=0.44/(1–0.74)=1.70. What does this mean? According to Jaeschke et al. (1994)  a LR+≥10 would be conclusive. The likelihood ratio equal to 1.70 observed here thus adds little information.
Again the corresponding uncertainty should be addressed using 95% confidence intervals. Formally the LR+ is a ratio of binomial proportions  where a confidence interval can be constructed on the scale of the natural logarithm. On the log scale the variance of the LR+ is given by
For our example we obtain .
Then a 95% CI is given by
For our example a 95% CI is obtained as .
Positive and negative predictive values
Sensitivity and specificity are used to describe the validity of a diagnostic test. These quantities can be expressed as conditional probabilities: . That is the probability that a diseased person will have a positive test result. . This is the probability that a healthy person will have a negative test result. From a clinical perspective the important question is whether a positive test indicates a diseased person. This is the positive predictive value . This probability is obtained using Bayes’s theorem. In order to apply Bayes’ theorem we need to define the prior probability of disease which is given by the prevalence .
The PPV can be expressed in terms of sensitivity (sens) and specificity (spec):
For our data we obtain a prevalence . Plugging this into the formula leads to
Thus, our diagnostic PCT test does little to improve our prior knowledge which is given by the prevalence equal to 43%. Performing the test tells us that the probability that the patient suffers from sepsis given that the PCT value is larger than 0.5 g/L is 57%.
As Figure 1 shows the PPV depends on the prevalence of disease as well on sensitivity and specificity combined as LR+. The PPV increases with increasing prevalence. Also a larger positive likelihood ratio leads to higher predictive values.
A Fagan plot  uses the pretest (prior) probability together with the LR+ and is a graphical tool for estimating how much the result of a diagnostic test changes the probability that a patient has a disease. The rationale for a Fagan plot is given by a formulation of Bayes’s theorem as follows: posterior ∝ prior × likelihood ratio. Again, this describes how our prior knowledge is improved by applying a diagnostic test and leads to the posterior knowlegde after performing the test. In order to create a Fagan nomogramm we need to apply odds.
Odds are defined as odds = where p is a probability. If we have a fair coin the odds for head are meaning that head and tail are equally likely. The prior odds in our example are given by =0.75. Formulating Bayes’s theorem in terms of odds leads to . In our example we obtain:
Odds can be transformed into probabilities using the following formula . For our data we obtain the .
A Fagan plot works on the log scale. Thus we use
Thus taking logs creates a linear equation. Hence, a Fagan plot consists of a vertical axis on the left with the pretest probability, an axis in the middle representing the LR+ and a vertical line showing the PPV. By connecting the pretest probability and the LR+ the post test probability is obtained. Please note, that although the labels on the left and right are written in terms of probability, the tick marks are spaced at the log odds scale. The Fagan plot of our example is shown in Figure 2.
The PPV tells us the probability whether a patient with positive test result really has the disease. The negative predictive value (NPV) is the probability that a person with a negative test result is really healthy. That is . In our example we obtain a NPV=641/1012=0.63.
Communicating predictive probabilities
Looking at Eq. (7) the use of Bayes’s appears tedious, sometimes hard to understand and difficult to communicate. Thus, Hoffrage and Gigerenzer  introduced the concept of natural frequencies. Traditionally medical doctors are told: The prevalence of sepsis is 43%, the sensitivity is 44% and the specificity is 74%. Please tell me the probability that the patient suffers from sepsis. This is error prone especially if the disease of interest is rare and sensitivity and specificity of the test are high .
Thus, we proceed as follows. Assume that we have 1,000 subjects where 434 suffer from sepsis (prevalence 43%). Of these 190 will be test positive (sensitivity=44%). Of the 570 patients without sepsis 148 will be false positive (specificity=74%). Then, the PPV equals . This means that of 100 patients with a positive test 57 suffer from sepsis and 43 are false positive. The application of natural frequencies is also shown in Figure 3.
Receiver operator curves
Until now we have assumed that we are dealing with a binary diagnostic test. By using a cut off 0.5 g/L we have transformed the continuous marker Procalcitonin into a binary test. Obviously, other cut-off values could be used. For example we could apply a cut off value ≥ 2.0 g/L. Then we would obtain Table 3. This leads to a sensitivity and a specificity . As a result increasing the cut off value form 0.5 g/L to 2.0 g/L led to a decreased sensitivity and an increased specificity. Looking at descriptive statistics of the Procalcitonin data we observe a median PCT value equal to 0.2 g/L with a minimum equal to 0.01 g/L and a maximum of 200 g/L. Obviously, we could use any value between minimum and maximum as a cut off value and calculate the corresponding sensitivity and specificity.
|PCT ≥2.0 g/L||176||99||275|
|PCT <2.0 g/L||491||771||1,262|
This is done when we create a receiver operator curve (ROC)  which is obtained by calculating the sensitivity and specificity of every observed data value and plotting sensitivity against 1-specificity. A test that perfectly discriminates between the two groups would yield a “curve” that coincided with the left and top sides of the plot since we would not have any false negative (FN) are false positive (FP) values. A useless test would give a straight line from the bottom left corner to the top right. This implies that a true positive and a false positive test result are equally likely.
The performance of the test can be assessed by using the area under the receiver operating characteristic curve (AUC). This area may be interpreted as the probability that a random person with the disease has a higher value of the measurement than a random person without the disease. A perfect test would have an AUC=1 and a useless test has an AUC=0.5. For our example we obtain an AUC=0.64 with 95% CI (0.61, 0.67). A good review for the construction of confidence intervals for the AUC is given by Cho et al. (2018) . In conclusion Procalcitonin offers moderate diagnostic discrimination at best as shown in Figure 4.
However, after having determined that a test provides good discrimination the best cut off point for clinical needs has to be chosen. One possible approach is given by maximizing the sum of the sensitivity and specificity. This leads to the so called Youden Index J=sens + spec − 1. Hence for each possible cut off value J is calculated and the value which leads to a maximum of J is chosen. For the data at hand we obtain a cut off value equal to 0.175 g/L with a sensitivity of 0.65 and a specificity of 0.56.
An approach which is just data driven is not helpful because also the clinical situation needs to taken into account. Schuetz et al. (2019)  for example “refined the established PCT algorithms by incorporating severity of illness and probability of bacterial infection and reducing the fixed cut-offs to only one for mild to moderate and one for severe disease 0.25 g/L and 0.5 g/L, respectively”.
Sample size estimation
Like in therapeutic clinical trials sample size estimation should be performed for diagnostic studies. Knottnerus and Muris (2003)  present the whole strategy needed for the development of diagnostic tests. This involves the selection of cases and controls and ensuring that a correct reference standard is defined. From a statistical point of view the allowable Type I and Type II errors, the primary outcome of interest together with a relevant effect size and is variability need to be defined in advance. Formulas and tables for the planning of binary tests may be found in the paper by Flahault et al. (2003) .
For continuous biomarkers sample size estimation can be based on the AUC of the ROC curve. Let us consider a phase I diagnostic study where we want to determine whether the new diagnostic test has any ability to discriminate diseased patients from healthy controls. Then the null hypothesis is that the AUC equals 0.5 vs. the alternative hypothesis that the AUC is ≠0.5. Formulas for sample size estimation may be found in Obuchowski et al. (2004, page 1123 Eqs. (2) and (3)) .
Let us assume that our new biomarker performs better than Procalcitonin with an AUC=0.7. We accept a Type I error of 5% (two-sided) and we want a power of 90%. Then we need 41 cases and 41 controls.
Using R for the calculations
The freely statistical available package R  may be used to perform the necessary calculations for our example. The package can be obtained at https://cran.r-project.org. A useful integrated software environment is given by by RStudio https://www.rstudio.com/. Using RStudio R scripts can easily be used to run the respective R commands. Data and a R script for our example are given in the supplementary material.
Importing and manipulating data
The data from our example are read from an Excel csv file and stored as an object named “diag.data”. The command “read.csv2” reads Excel files in .csv format. First comes the name of the file. Next “header=T” implies that the first line of the file contains the variable names. Finally “na.string=. ‘means missing values are indicated by to “.”
This object “diag.data” contains the data and can be modified. Here, the data column “Procalcitonin” contains the biomarker values in g/L. In a first step we create a new binary indicator named “PCT” which is a new column of our data. This indicator takes the value “1” if the Procalcitonin level is ≥ 0.5 g/L and 0 otherwise. Then, we attach value labels. First, we declare the variable as a “factor” and assign the value labels.
## Apply the widely used cutoff value 0.5 g/L
# Define the variable PCT as a factor
# assign value labels levels(diag.data$PCT)<-c(‘<0.5 g/L’,‘>=0.5 g/L’)
The command “attach” provides access to the individual elements of the data object “diag.data”. Now we can perform basic descriptive statistics using the command “summary”. For the variable “Proacalcitonin” we obtain
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.010 0.060 0.200 3.376 1.070 200.000
Calculating diagnostic parameters
In the next step we create a 2 × 2 table with our diagnostic test variable “PCT” vs. “sepsis2”. We exclude missing values indicated by “.” and change the ordering of the table and obtain
PCT Yes No
>=0.5 g/L 296 229
<0.5 g/L 371 641
The package “epiR” is used to calculate sensitivity specificity etc. First we need to install the package (only once) and then load its functionality using the command “library”. Finally we submit our table named “table05” to the function “epi.tests” which calculates sensitivity etc.
This gives the (shortened) result
Point estimates and 95% CIs.
True prevalence *
0.43 (0.41, 0.46)
0.44 (0.41, 0.48)
0.74 (0.71, 0.77)
Positive predictive value *
0.56 (0.52, 0.61)
Negative predictive value *
0.63 (0.60, 0.66)
Positive likelihood ratio
1.69 (1.47, 1.94)
Negative likelihood ratio
0.75 (0.70, 0.82)
Construction of a Fagan plot
A Fagan plot is constructed using the library “Teachingdemos” and submitting a prevalence of 0.43 and LR+=1.7 to the function “fagan.plot”.
The result is shown in Figure 2.
Receiver operator curves
In the next step we use Procalcitonin as a continuous biomarker and construct a ROC-curve and calculate the AUC together with a 95% CI using the package pROC .
cut1<-roc(sepsis2∼Procalcitonin,data=diag.data, percent=F,print.auc=T,ci=T) print(cut1)
Data: Procalcitonin in 870 controls (sepsis No) < 667 cases (sepsis Yes).
Area under the curve: 0.6407
95% CI: 0.613–0.6683 (DeLong)
In the next step we obtain the optimal cut off value based on Youden’s index J and plot the ROC curve
coords(cut1, “best”, “threshold”,best.method=“youden”) plot.roc(sepsisProcalcitonin, print.auc=T,ci=T,data=pct,percent=F,legacy.axes=T,grid=T)
The ROC curve is shown in Figure 4 and the cut off value is given by
threshold specificity sensitivity
0.175 0.562069 0.6521739
Sample size estimation
If we want to estimate the necessary sample size for a diagnostic Phase I study assuming an AUC = 0.7, 90% power and two sided significance level of 5% we can use:
power.roc.test(auc=0.70, sig.level=0.05, power=0.90, alternative=“two.sided”)
One ROC curve power calculation
ncases = 40.21369
ncontrols = 40.21369
auc = 0.7
sig.level = 0.05
power = 0.9
Hence we would include 82 subjects into our study.
I would like to thank Dr. Ljungstroem  and colleagues for the allowance to use the data from their study as an example for this article.
Research funding: None declared.
Author contributions: Single author statement.
Competing interests: Author states no conflict of interest.
Informed consent: Not applicable.
Ethical approval: Not applicable.
1. Fleischmann, C, Scherag, A, Adhikari, NKJ, Hartog, CS, Tsaganos, T, Schlattmann, P, et al.. Assessment of global incidence and mortality of hospital-treated sepsis. Current estimates and limitations. Am J Respir Crit Care Med 2016;193:259–72, https://doi.org/10.1164/rccm.201504-0781oc.Search in Google Scholar
2. Wacker, C, Prkno, A, Brunkhorst, FM, Schlattmann, P. Procalcitonin as a diagnostic marker for sepsis: a systematic review and meta-analysis. Lancet Infect Dis 2013;13:426–35, https://doi.org/10.1016/s1473-3099(12)70323-7.Search in Google Scholar
3. Ljungström, L, Pernestig, AK, Jacobsson, G, Andersson, R, Usener, B, Tilevik, D. Diagnostic accuracy of procalcitonin, neutrophil-lymphocyte count ratio, C-reactive protein, and lactate in patients with suspected bacterial sepsis. PLoS One 2017;12:e0181704, https://doi.org/10.1371/journal.pone.0181704.Search in Google Scholar PubMed PubMed Central
5. Altman, DG, Bland, JM. Statistics Notes: diagnostic tests 1: sensitivity and specificity. BMJ 1994;308:1552, https://doi.org/10.1136/bmj.308.6943.1552.Search in Google Scholar PubMed PubMed Central
6. Levy, MM, Fink, MP, Marshall, JC, Abraham, E, Angus, D, Cook, D, et al.. 2001 SCCM/ESICM/ACCP/ATS/SIS international sepsis definitions conference. Crit Care Med 2003;31:1250–6, https://doi.org/10.1097/01.CCM.0000050454.01978.3B.Search in Google Scholar PubMed
7. Singer, M, Deutschman, CS, Seymour, CW, Shankar-Hari, M, Annane, D, Bauer, M, et al.. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 2016;315:801, https://doi.org/10.1001/jama.2016.0287.Search in Google Scholar PubMed PubMed Central
10. Jaeschke, R. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The Evidence-Based Medicine Working Group. JAMA: J Am Med Assoc 1994;271:703–7, https://doi.org/10.1001/jama.271.9.703.Search in Google Scholar PubMed
13. Gigerenzer, G, Hoffrage, U. How to improve Bayesian reasoning without instruction – frequency formats [journal article]. Psychol Rev 1995;102:684–704, https://doi.org/10.1037/0033-295x.102.4.684.Search in Google Scholar
16. Cho, H, Matthews, GJ, Harel, O. Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data. Int Stat Rev 2018;87:152–77, https://doi.org/10.1111/insr.12277.Search in Google Scholar
17. Schuetz, P, Beishuizen, A, Broyles, M, Ferrer, R, Gavazzi, G, Gluck, EH, et al.. Procalcitonin (PCT)-guided antibiotic stewardship: an international experts consensus on optimized clinical use. Clin Chem Lab Med 2019;57:1308–18, https://doi.org/10.1515/cclm-2018-1181.Search in Google Scholar
18. Knottnerus, JA, Muris, JW. Assessment of the accuracy of diagnostic tests: the cross-sectional study. J Clin Epidemiol 2003;56:1118–28, https://doi.org/10.1016/s0895-4356(03)00206-3.Search in Google Scholar
19. Flahault, A, Cadilhac, M, Thomas, G. Sample size calculation should be performed for design accuracy in diagnostic test studies. J Clin Epidemiol 2005;58:859–62, https://doi.org/10.1016/j.jclinepi.2004.12.009.Search in Google Scholar PubMed
20. Obuchowski, NA, Lieber, ML, Wians, FH. ROC curves in clinical chemistry: uses, misuses, and possible solutions. Clin Chem 2004;50:1118–25, https://doi.org/10.1373/clinchem.2004.031823.Search in Google Scholar PubMed
21. Robin, X, Turck, N, Hainard, A, Tiberti, N, Lisacek, F, Sanchez, JC, et al.. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf 2011;12:77, https://doi.org/10.1186/1471-2105-12-77.Search in Google Scholar PubMed PubMed Central
© 2022 Walter de Gruyter GmbH, Berlin/Boston