Report from the HarmoSter study: impact of calibration on comparability of LC-MS/MS measurement of circulating cortisol, 17OH-progesterone and aldosterone

Objectives: Liquid chromatography-tandem mass spectrometry (LC-MS/MS) is recommended for measuring circulating steroids. However, assays display technical heterogeneity. So far, reproducibility of corticosteroid LC-MS/MS measurements has received scant attention. The aim of the study was to compare LC-MS/MS measurements of cortisol, 17OH-progesterone and aldosterone from nine European centers and assess performance according to external quality assessment (EQA) materials and calibration. Methods: Seventy-eight patient samples, EQA materials and two commercial calibration sets were measured twice by laboratory-specific procedures. Results were obtained by in-house (CAL1) and external calibrations (CAL2 and CAL3). We evaluated intra and inter-laboratory imprecision, correlation and agreement in patient samples, and trueness, bias and commutability in EQA materials. Results: Using CAL1, intra-laboratory CVs ranged between 2.8 – 7.4%, 4.4 – 18.0% and 5.2 – 22.2%, for cortisol, 17OH-progesterone and aldosterone, respectively. Trueness and bias in EQA materials were mostly and 13.8% with CAL1 and 3.6, 10.3 and 8.6% with CAL3 (all p<0.001), respectively. Using CAL1, median bias vs. all laboratory-medians ranged from − 6.6 to 6.9%, − 17.2 to 7.8% and − 12.0 to 16.8% for cortisol, 17OH-progesterone and aldosterone, respectively. Regression lines signi ﬁ cantly deviated from the best ﬁ t for most laboratories. Using CAL3 improved cortisol and 17OH-progesterone between-method bias and correlation. Conclusions: Intra-laboratory imprecision and performance with EQA materials were variable. Inter-laboratory performance was mostly within specifications. Although residual variability persists, adopting common traceable calibrators and RMP-determined EQA materials is beneficial for standardization of LC-MS/MS steroid measurements.

Due to the abundance of isobars within the steroid family, chromatographic separation is critical for LC-MS/ MS specificity. Using stable isotope-labeled analytes as internal standards (IS) minimizes procedural variability and matrix interference. However, IS with different isotopes and substitutions may influence measurement accuracy [15][16][17]. Therefore, LC-MS/MS methods may display different susceptibilities to isobaric and matrix-related interferences. Whether performance of LC-MS/MS methods differs according to serum or plasma sample matrix and associated anticoagulants or coagulation supports has received limited examination [18].
Calibration is a major determinant of accuracy. Solvent-based certified reference materials (CRMs) are currently available for most steroids. However, to prepare calibrators, CRMs need to be diluted in solvents or surrogate matrices, introducing procedural and matrix variability. Adopting common calibrators was found to improve inter-laboratory performance in some studies [7,13] but not in others [9].
External quality assessment (EQA) programs are crucial tools for harmonization [19]. EQA materials differ in the assignment of target values, including reference measurement procedures (RMPs) or mean/median of survey results, and experimental evidence on their commutability by LC-MS/MS remains scarce [20].
The extent to which the aforementioned factors impact LC-MS/MS results and reproducibility can be significant. International initiatives promoting harmonization and traceability, and databases of RMPs and CRMs are now available [21,22]. Based on these, standardization of clinically relevant steroids by LC-MS/MS appears to be a realistic goal.
The high quality of the assays is a prerequisite for achieving harmonization of laboratory tests [23]. However, even when methods are validated by recommended guidelines, actual performance is hardly inferable from publications. Consequently, there is increasing attention toward applying strict validation requirements and performance reporting protocols [24,25]. Moreover, aiming at easing the harmonization process, the incoming EU in vitro diagnostic regulation (IVDR) [26] states that commercial tests with appropriate analytical and clinical performance should be preferred to equivalent laboratory developed tests (LDTs). However, comparing LDTs and commercial kits based on the reported performance can be cumbersome [25].
The HarmoSter initiative aims to evaluate the harmonization status of LC-MS/MS measurement of ten circulating steroids (cortisol, 17OH-progesterone, aldosterone, dehydroepiandrosterone, dehydroepiandrosterone-sulfate, androstenedione, testosterone, corticosterone, 11-deoxycortisol and cortisone) by nine European centers (Supplemental Table 1). Authentic samples collected by three different vacuum tubes, EQA materials and commercial calibrators were tested. The present work focuses on the impact of calibration on intraand inter-laboratory variability for three of the ten steroids of the HarmoSter initiative: cortisol, 17OH-progesterone and aldosterone.

Consortium and methods
The Bologna ethics committee approved the HarmoSter study (n°141/ 2017/U/Tess). Laboratory A coordinated the study, recruited patients and collected samples. Laboratories B to L were measuring centers: all measured cortisol and 17OH-progesterone; five also measured aldosterone (Supplemental Table 1 and Table 1). Eleven LDTs (Laboratories B to I) and two panel MassChrom ® kit (Chromsystems; Munich, Germany; https://chromsystems.com) (Laboratory L) were involved. Among LDTs, 6PLUS1 ® Multilevel Serum Calibrator set (Chromsystems) was used for in-house calibration for cortisol and aldosterone by Laboratories D and E, and for 17OH-progesterone by Laboratories D, E and F. Technical details and in-house measurement ranges are shown in Table 1 and Table 2.
EQA materials were sent to Laboratory A and stored according to manufacturers' specifications. The Reference Institute for Bioanalytics (RfB; Bonn, Germany; www.rfb.bio) donated four materials (HM40121, HM40122, HM40123 and HM40124; lyophilized human recalcified plasma spiked with steroids and no preservatives), and assigned with cortisol, 17OH-progesterone and aldosterone RMP target values. The United Kingdom National External Quality Assessment Service (UKNEQAS; Birmingham, UK; https://ukneqas.org.uk) donated eight liquid materials (off-the-clot minimally manipulated human serum), four with cortisol (C568 and C532), 17OH-progesterone (H408, spiked with 17OH-progesterone) and aldosterone (L125, spiked with aldosterone) target values determined as means of all MS-based methods in the survey. Instand e.V (Düsseldorf, Germany; https://www.instandev.de). donated three low/high concentration paired materials (human serum spiked with steroids and no additives), two with target values assigned by RMP for cortisol (N°302, liquid), and as means of all methods in the survey for aldosterone (N°304, lyophilic). None of the EQA materials were directly tested for commutability.
The Biological Sales Network (BSN Srl, Castelleone, Italy; https:// www.bsn-srl.it) donated seven level serum-based liquid calibrators for a ten-steroid panel (EUM01041, lot.M01411808VEQ) traceable to the Royal College of Pathologists of Australasia Quality Assurance Programs (RCPAQAP). Target values of RCPAQAP materials were determined by LC-MS/MS by the National Measurement Institute Australia (traceable to CRM-6007a) for cortisol, and by all laboratory medians for 17OH-progesterone and aldosterone. Freshly prepared calibrators were delivered to Laboratory A, immediately aliquoted and stored at −80°C. Chromsystems donated serum-based lyophilic calibrators for a fifteen-steroid panel (6PLUS1 ® Multilevel Serum Calibrator set, lot.5016, different from lots used for in-house calibration by Laboratories D, E, F and L). Cortisol was traceable to NIST SRM-971frozen human serum; 17OH-progesterone and aldosterone were traceable to CRMsmethanolfrom ISO 17025 and 17034 certified supplier. At Laboratory A, calibrators of each level were reconstituted according to manufacturer's instructions, mixed together, aliquoted and stored at −80°C. BSN and Chromsystems measurement ranges were 11.92-3, 182 and 25.6-806 nmol/L for cortisol, 0.61-166.2 and 0.27-43.9 nmol/ L for 17OH-progesterone, and 0.11-25.68 and 0.075-12.5 nmol/L for aldosterone, respectively.

Running scheme and quantitation
Two aliquots from 110 samples were shipped to measuring centers on the same day and stored at −80°C. EQA materials were stored and handled according to manufacturers' indications. All were measured within 4 months by two identical runs, each including 110 singlets and an independent in-house calibration set, according to protocols ordinarily used by each laboratory. Results by in-house calibration (CAL1) were sent to Laboratory A before measuring centers received external calibrators' nominal values. They were then asked to use BSN (CAL2) and Chromsystems (CAL3) sets for re-quantification of results. External calibrators were included in the curve calculation if their nominal concentration was within each method's range. Calibration curves displayed R 2 >0.98.
Intra-laboratory performance: Within-method imprecision was calculated in authentic samples according to the formula intralaboratory CV : duplicate (a and b) mean), and compared with the MAI. Withinmethod impact of calibration was evaluated by the Friedman test and Passing-Bablok regression. Within-method trueness and bias were estimated in EQA materials as % difference of RMP or mean/median of the surveys as target values, respectively, and were compared with the MAB. Least-squares regression lines were calculated in authentic sample CAL1 results from each laboratory vs. all laboratoriesmedians; 95% prediction intervals were used to test EQA materials commutability [29].
Inter-laboratory performance: Analyses were performed in authentic samples measured with duplicate-CV <30% and within both CAL1 and CAL3 ranges (Supplemental Tables 2-4). Between-method reproducibility, valued by the inter-laboratory CV, was compared with the MAI. Between-method regression was assessed by Passing-Bablok analysis. Between-method agreement was valued by %-bias vs. all methods median and Bland-Altman; results were compared with the TAE. Wilcoxon and F tests were used to compare CAL1 and CAL3 interlaboratory CV and bias. Statistics were performed by SPSS (v.20, IBM Co., Somers, NY) and MedCalc (v.18.2.1; Mariakerke, Belgium).

Cortisol
Authentic sample results ranged from 221 to 994 nmol/L (median of all laboratories by CAL1). Three to nine samples were above CAL1 ULOQ in three laboratories. The highest CAL2 calibrator was above the in-house measurement range of all laboratories. Conversely, three samples were above the CAL3 measurement range (Supplemental Table 2). The intra-laboratory CVs ranged from 2.8 to 7.4%, all below the MAI ( Table 2). Calibration influenced the measures within all laboratories (p<0.001). Laboratory D results were largely lowered when using CAL2 (−40.2%) and CAL3 (−31.8%) compared to CAL1. For other laboratories, when compared with CAL1, CAL2 determined −18.8 to −9.7% deviation, while CAL3 yielded modest deviations   Table 5). Comparing CAL1 vs. CAL3 by Passing-Bablok within laboratories using Chromsystems' as in-house calibration, confirmed the large proportional overestimation of the former in Laboratory D and the consistency of results in Laboratories E and L (Supplemental Figure 1) EQA material target values ranged from 265.0 to 948.5 nmol/L; two were above CAL1 and CAL3 measurement ranges in Laboratories D, E and L (Supplemental Table 2). Trueness and bias by CAL1 and CAL3 were mostly within MAB, except Laboratory D, displaying a large positive bias with CAL1. CAL2 determined a negative deviation, with several cases exceeding the MAB, especially for Laboratories D and F (Figure 1). Commutability was demonstrated except for some high level EQA materials slightly outside the interval for three laboratories (Supplemental Figure 2).
Given the large deviation shown by Laboratory D, inter-laboratory analyses were performed both with and without data from that laboratory, the latter reported as follows. Median inter-laboratory CV by CAL1 was 4.9%, and it was significantly reduced to 3.6% with CAL3 (p<0.001). No cases were detected with CV >MAI (Figure 2 and Supplemental Table 6).
At Passing-Bablok analysis, slopes were similar to 1 in two laboratories with CAL1 and in six with CAL3 (Table 3).
Median bias ranged from −6.6 to 6.9% for CAL1 and −2.3 to 5.5% for CAL3, with all results within the TAE (Figure 3). Compared to CAL1, CAL3 significantly reduced the bias median within six and variance within three laboratories (Supplemental Table 7). Bland-Altman analyses per laboratory are shown in Supplemental  Figures 3 and 4.

17OH-Progesterone
Authentic sample results ranged from 0.34 to 6.63 nmol/L (median of all laboratories by CAL1). Values were below the CAL2 measurement range in 33 authentic samples (Supplemental Table 3). The intra-laboratory CV ranged from 4.4 to 18.0%, exceeding the MAI in Laboratory H and L (  Table 5). CAL1 vs. CAL3 Passing-Bablok analysis within laboratories using Chromsystems' as in-house calibration detected small deviations in Laboratory D and E, but substantial consistency in Laboratories F and L (Supplemental Figure 1).
EQA materials target values ranged from 2.41 to 12.73 nmol/L. Trueness and bias by CAL1 and CAL3 were mostly within the MAB, except for Laboratory G (−22.2 to −12.3%), and Laboratory H, showing higher values at lower levels (−25.8 to 109.1%). CAL2 increased the negative biases, with several cases exceeding the MAB (Figure 1). A slight deviation from the commutability interval was observed in high level EQA materials in two laboratories. Moreover, HM materials lacked commutability in Laboratory H (Supplemental Figure 2). The 11.8% median inter-laboratory CVs observed with CAL1 reduced to 10.3% with CAL3 (p<0.001), while cases with inter-laboratory CV>MAI were 33.3 and 7.7%, respectively. Notably, CAL1 CVs increased at levels <1 nmol/L ( Figure 2 and Supplemental Table 6).
Median bias vs. median of all laboratories ranged from −17.2 to 7.8% for CAL1 and -20.9 to 10.3% for CAL3 (Figure 3). With both calibrations, Laboratory G showed the largest negative median and Laboratory H the largest variance of bias. Compared to CAL1, CAL3 significantly reduced the bias median within four and variance within three laboratories (Supplemental Table 7). Moreover, CAL3    target values determined as mean/median of the EQA survey was evaluated in materials C568 and C532 for cortisol, H408 for 17OH-progesterone, N°304 and L125 for aldosterone. Lines: zero ± maximum allowable bias (cortisol: 13.5%, 17OH-progesterone: 12.0% and aldosterone: 12.6%). Black dots: in-house calibration (CAL1); gray dots: BSN calibration (CAL2); white dots: Chromsystems calibration (CAL3). reduced the opposite systematic bias shown by Laboratory D and H at lowering concentrations with CAL1 (Supplemental Figure 5), almost eliminating cases exceeding the TAE (Figure 3).

Aldosterone
Authentic sample results ranged from 0.07 to 1.22 nmol/L (median of all laboratories by CAL1). Values were below the CAL2 measurement range in 32 authentic samples for most of the laboratories (Supplemental Table 4). The intra-laboratory CV ranged from 5.2 to 22.2%, with Laboratory L exceeding the MAI (Table 2). Duplicate-CV increased at values <0.2 nmol/L. Calibration influenced results within all laboratories (p<0.001). Compared to CAL1 values, the deviation was −13.1 to 25.2% for CAL2 and −10.0 to 22.1% for CAL3 (Supplemental Table 5). CAL1 vs. CAL3 Passing-Bablok analysis within laboratories using Chromsystems' as in-house calibrators detected deviations in Laboratory D and E, but consistent results in Laboratory L (Supplemental Figure 1).
EQA material target values ranged from 0.21 to 1.33 nmol/L. A large negative bias was shown for N°304 low (0.21 nmol/L) by all laboratories and calibration sets (−48.6 to −31.6%, or below measurement range). Trueness and bias by CAL1 mostly exceeded the MAB in Laboratory D, G and H. External calibrations improved trueness and bias in Laboratory G only (Figure 1). A slight deviation from the 95% commutability interval was noted in high level EQA materials in three laboratories (Supplemental Figure 2).
Using CAL1, median bias vs. median of all methods ranged from −12.0 to 16.8%, with almost all cases within the TAE. CAL3 significantly reduced bias median, ranging from −8.8 to 3.4%, in three laboratories and variance in one (Figure 3, Supplemental Table 7 and Supplemental Figure 6).

Discussion
This is the first study examining the variability among LC-MS/MS measurements for circulating cortisol, 17OHprogesterone and aldosterone. Methods were validated according to recommended guidelines, most were published [2,12,[30][31][32][33][34][35][36][37], and exhibited considerable technical heterogeneity. Within-and between-laboratory performances were interpreted by means of allowable imprecision, bias and total error, which, however, are still derived from immunoassay-based biological variability studies [27]. Intra-method imprecision for cortisol was within MAI for all, but unsatisfactory for 17OH-progesterone and aldosterone in some laboratories, typically worsening at lower concentrations. Trueness and bias in EQA materials for aldosterone were unsatisfactory in three laboratories, but mostly acceptable for cortisol and 17OH-progesterone. Exceptions were Laboratory D for cortisol, due to a miscalibration, and, for 17OH-progesterone, Laboratory G, showing a constant bias, and Laboratory H, showing a severe commutability problem in HM materials. Hence, the study allowed the discovery and correction of some laboratory-specific defects, such as in Laboratory D, where the incorrect dilution of newly utilized commercial calibrators in place of previously used in-house calibrators caused large cortisol overestimation. These data underline the importance of promoting uniform protocols for method validation, performance reporting and monitoring beyond the validation stage [24,25,38]. Findings also implies that LC-MS/MS is not immune from commutability issues. Testing EQA materials from different providers could assist identification of potential commutability problems. Moreover, our data about material N°304 low suggests that assigning target values as all methods means/median may confound method evaluation. Therefore, application of RMPs should be encouraged among EQA providers [22].
Current guidelines for clinical LC-MS/MS assays recommend calibrators being prepared in the same matrix as samples to be tested [39]. Some laboratories used steroid-depleted serum/plasma processed by charcoalstripping. Others used surrogate matrices (bovine serum albumin or phosphate buffer solutions) or solvents. Although the adequacy of these alternative matrices in comparison with the native matrix has been tested by each of the participant, we cannot exclude that in-house calibrator commutability is contributing to the overall interlaboratory variability.
Despite the aforementioned heterogeneities, betweenmethod performance was substantially within specifications, indicating impressive consistency of the LC-MS/MS methods under investigation. When evaluating the commercial calibrators, we found that the BSN highest cortisol calibrator was above all the in-house measurement ranges, while Chromsystems' was too low, preventing the measurement in one woman taking estroprogestin and in two EQA materials. Moreover, BSN calibration did not cover the low-physiologic concentrations of 17OH-progesterone and aldosterone. Therefore, commercial calibrators need to be adapted to assay features and to reporting purposes of laboratories. Compared to in-house, BSN calibration seemed to worse methods' trueness for cortisol and 17OH-progesterone. This may derive from BSN calibrators being traceable to EQA programs, such as RCPAQAP, and not to reference materials. At variance, Chromsystems calibrators are traceable to NIST and CRMs, and showed a good overlap with most of the in-house calibrations. Unifying calibration using Chromsystems' set improved between-laboratory performance, particularly reducing the systematic bias at low 17OH-progesterone levels. Laboratory L, the only laboratory using a commercial kit, exhibited high intra-laboratory imprecision. The possible reason involved the need for more frequent cleaning of the ionization source. Therefore, laboratory experience in instrument maintenance and method monitoring is recommended even when using commercial kits. Four laboratories used Chromsystems' as in-house calibrators. A part from the procedural problem of Laboratory D, modest deviations between in-house and external calibrations were found in some cases, which may be procedural or due to lot-to-lot variability.
Our study supports the use of common calibrators for improving the harmonization of measurements. However, we showed that commercial calibrators or kits are not necessarily superior to in-house procedures and do not guarantee adequate performance per se. Conversely, setting more stringent requirements and standardized reporting protocols for analytical performance should be encouraged for improving research quality and clinical effectiveness. This is particularly relevant in view of the EU-IVDR, effective in May 2022, promoting commercial devices over equivalent LDTs [26].
Non-negligible residual between-laboratory variability was observed when unifying calibration, implying that other components have to be considered. Monitoring the qualifier ions is often ineffective in detecting interferences among isobars. Therefore, LC resolution is critical in ensuring specificity. Various 17OH-progesterone isobars circulate at relevant levels [14]. All methods in this study achieved baseline separation of 17OH-progesterone from 11-deoxycorticosterone. 17OH-progesterone separation from 16OH-progesterone has been verified by Laboratories B, C and L, and from 11OH-progesterone by Laboratories B, I and L. Interference from these or other isobars may explain the overestimation in HM materials by Laboratory H.
Although multiple isotopes were used as IS, laboratories using the same IS did not show a better between-laboratory performance. Nonetheless, LC and IS contributions need to be directly addressed in purposely designed studies. Notably, Loh et al. recently reported that deuterium and 13 C-based IS provided similar 17OH-progesterone results by three LC-MS/MS methods [15].
Our design included many laboratories and duplicate measurements of individual samples. Moreover, three different tubes were used, whose impact on measurements will be investigated in a dedicated study. While these aspects reinforced the robustness of findings, they required large blood volumes, which are not easily obtained from patients. Consequently, we recruited a suboptimal number of 26 volunteers [40] mostly showing normal steroid concentrations. Further studies are needed to assess harmonization at high or low concentrations typical of hypercortisolism, hyperaldosteronism, adrenal insufficiency or functional tests. Nevertheless, reproducible measurement of physiological levels is relevant for therapeutic monitoring and for emerging multi-analyte approaches to diagnosis, subtyping and risk stratification of complex conditions, such as female hyperandrogenism, endocrine tumors and non-communicable diseases [41]. Finally, our study was designed as a ring trial. Studies including target values achieved by RMPs or GC-MS methods are needed to address standardization, which is mandatory for establishing laboratory-independent reference intervals and decision limits.
Our study described pitfalls in method performance and EQA materials, as well as advantages and limitations of calibration materials. Unifying the calibration could significantly reduce between-laboratory variability, while adopting traceable calibrators and RMP-determined EQA materials could ease standardization. Residual disagreement requires investigation. Evidence generated by our study supports LC-MS/MS utility for achieving steroid harmonization.
contributed to the study design; MC, MM, JML, MP, JMH, SB, MTA, JVDO and DK carried out sample measurement and data exports; AT coordinated subject recruitment and sample management; EN performed the statistical analysis; UP conceived the study; MC, MP, JMH, SB, PAB, MTA, ACH, JVDO, FM, MR, GE, BGK, MV and UP contributed in result interpretation and in writing the manuscript. Competing interests: Authors state no conflict of interest. Informed consent: Informed consent was obtained from all individuals included in this study. Ethical approval: The local Institutional Review Board deemed the study exempt from review. The Bologna Ethics Committee approved the study n°141/2017/U/Tess. All volunteers participating in the study signed their informed consent.