Performance evaluation of internal quality control rules, EWMA, CUSUM, and the novel machine learning model İ ç kalite kontrol kurallar ı , ÜAHO, KÜTOP ve yeni makine ö ğ renim modelinin performans

Objectives: The present study set out to build a machine learning model to incorporate conventional quality control (QC)rules,exponentiallyweightedmoving average (EWMA), and cumulative sum (CUSUM) with random forest (RF) algorithm to achieve better performance and to evaluate the performances the models using computer simulation to aid laboratory professionals in QC procedure planning. Methods: Conventional QC rules, EWMA, CUSUM, and RF models were implemented on the simulation data using an in-house algorithm. The models ’ performances were evaluated on 170,000 simulated QC results using outcome metrics, including the probability of error detection (Ped), probability of false rejection (Pfr), average run length (ARL), and power graph. Results: The highest Pfr (0.0404) belonged to the 1 – 2s rule. The 1 – 3s rule could not detect errors with a 0.9 Ped up to 4 SD of systematic error. The random forest model had the highest Ped for systematic errors lower than 1 SD. However, ARLs of the model require the combined utility of the RF model with conventional QC rules having lower ARLs or more than one QC measurement is required. Conclusions: The RF model presented in this study showed acceptable Ped for most degrees of systematic error. The outcome metrics established in this study will help laboratory professionals planning internal QC.


Introduction
The probability, severity, and detectability of an error constitute three pillars to be considered in risk management of clinical laboratory [1]. In addition, quality control (QC) rules are involved in the formal decision-making process, which aims to detect whether examination procedures performed in-control state or not.
The conventional QC rules such as 1-2s, 1-3s, 2-2s, 4-1s, and 10x are implemented on Shewhart's graph, displaying QC results, the target value (mean), and multiples of standard deviation for an available time period [2]. In addition to conventional Westgard rules, trend detection rules like exponentially weighted moving average (EWMA), cumulative sum (CUSUM) help laboratory professionals to detect smaller shifts and trends [1]. Robert first presented the EWMA chart in 1959, created by calculating the weighted average of current and previous data [3]. EWMA chart is superior to Shewhart's graph for detecting small shifts from the target value, particularly when a lower weighting factor (0.05-0.2) is used [4,5]. Page developed the CUSUM chart in 1954 by plotting deviations of the measured values from the target value [6]. CUSUM charts can also identify small shifts more efficiently than Shewhart's chart like EWMA charts [7]. Thus, EWMA and CUSUM charts help to detect trends before becoming substantial errors.
QC rules can be selected using the outcome metrics. The outcome metrics of QC procedures, such as the probability of error detection (Ped), probability of false rejection (Pfr), average run length (ARL), can be estimated either by assessment of retrospective data, by mathematical calculations, and by computer simulations [1]. Power graphs are also utilized to select suitable QC rules and the number of QC measurements according to measurement procedures performance [8]. The desirable Pfr should be as low as possible; however, lower Pfr may accompany lower Ped [1]. It is crucial to achieving a balance between low Pfr and high Ped to meet acceptable performance. While these metrics are important for implementing proper QC strategy according to CLSI C24ed4, only a few old-dated studies [9,10] investigated mostly conventional rules, and one study addressed the comparison of EWMA with conventional rules [11]. Furthermore, to date, no study has investigated the implementation of machine learning algorithms in the QC practice of laboratory medicine.
The present study set out to built a machine learning model to incorporate conventional QC rules, EWMA, and CUSUM rules with a random forest algorithm to achieve better performance. Furthermore, this research aimed to evaluate the performances of the rules mentioned above and the random forest model using computer simulation to aid laboratory professionals in QC procedure planning.

Materials and methods
The present study implemented conventional internal QC rules, EWMA, and CUSUM approaches on the simulation data using an inhouse algorithm written in Python 3.7.6. programming language [12] (the source code can be downloaded from: https://github.com/ hikmetc/IQCAI). Figure 1 summarizes the overall study design. The simulation procedure and calculations were based on one QC measurement per run.

Conventional internal QC rules
The conventional internal QC rules defined in Table 1 were embedded in our algorithm to generate an output about the status of the QC result as 1 and 0 for out-of-control and in-control conditions, respectively.

EWMA approach
EWMA chart was constructed using the following formula [7]: λ: Weighting factor which adjusts the weight to be given in current and previous results.
Upper and lower control limits were determined as follows [7]: √ μ=z 0 : Target value (mean). L: The factor determines the width of upper and lower control limits. σ: Standard deviation.
If z i surpasses control limits, the process is considered out-of-control [7]. The QC procedure simulation utilized three different weighting factors (0.05, 0.1, and 0.2). The present study's algorithm for EWMA outputs as 1 and 0 for out-of-control and in-control conditions, respectively.

CUSUM approach
CUSUM chart was applied after the standardization of QC result (x i ) with the following formula [7]: x i − μ 0 σ y i : the standardized value of x i . μ 0 : Target value (mean). σ: Standard deviation.
Standardized CUSUM values (C) for positive and negative deviations were calculated as follows, respectively [7]: k: the reference value is regarded as 0.5. When the CUSUM value (y i ) exceeds the default control limit (h=5), the process is considered out-of-control [7].
The algorithm for CUSUM gives the result as 1 and 0 for out-ofcontrol and in-control conditions, respectively.
Conventional QC rules (including 1-2s, 1-3s, 2-2s, 4-1s, 8x, 10x, and 12x), EWMA, and CUSUM rules were implemented on the abovementioned 32,000 internal QC results, and the outputs were recorded as 1 and 0 for out-of-control and in-control conditions, respectively. The output results of QC rules were utilized as input variables and the presence of an error as the target variable for the machine learning model ( Figure 1, Step 1).
A total of 32,000 output results were split into training and test data sets using 80 and 20% of the overall data, respectively. The training data set was used to built the machine learning model using the random forest algorithm and for ten-fold cross-validation [14] to The random forest classifier model used 200 decision trees, entropy for information gain, and other default parameters based on the scikitlearn 0.24.1 package [15,16] (the source code can be downloaded from: https://github.com/hikmetc/IQCAI). Hyperparameter optimization was not performed. The random forest model's performance was assessed using the test set with sensitivity, specificity, accuracy, and receiver operating characteristic (ROC) curve analysis.
Performance evaluation of conventional QC rules, EWMA, CUSUM and random forest model According to CLSI C24ed4, computer simulations and power graphs are required to evaluate QC rules performance. The Ped, Pfr, and ARL are considered crucial outcome metrics for internal QC rule evaluation [1]. Therefore, the external performance evaluation of the random forest model and general performance evaluation of conventional QC rules, EWMA, and CUSUM approaches were performed using the outcome metrics and power graph ( Figure 1, Step 2).
The overall performances of the QC rules, EWMA, CUSUM, and random forest model were evaluated on the total number of 170,000 simulated QC results consist of in-control and out-of-control conditions. This data was independent of the train data set used for the random forest model's training. Systematic errors corresponding to 0.25, 0.50, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0, 2.25, 2.50, 2.75, 3.0, 3.25, 3.50, 3.75, and 4.0 multiples of standard deviations introduced separately to on Gaussian distributed in-control simulation data to achieve out-ofcontrol condition. The Ped and ARL were calculated after adding each degree of systematic error to the in-control results (1,000 in-control results followed by 1,000 out-of-control results in each systematic error simulation). While run-length values were determined by subtracting the order of the first rejection signal from the beginning of the out-of-control results, the frequency of rejection signal among out-ofcontrol results yielded a probability of error detection. Then, the outof-control result simulation was repeated 10 times, and average values of Ped and run-length were recorded ((10 × 1,000) × 16 degrees of systematic errors).
The Pfr values were determined from the data, including 10,000 in-control QC results (Pfr=number of false rejection signal/total incontrol QC results). Then, the power graph was formed using Ped and Pfr values of QC.
The QC rules corresponding to the Ped of ≥0.9 were regarded as acceptable for an intended degree of systematic error [2]. The probability of ≤0.05 was considered acceptable for false rejection [2].
The outcome metrics were calculated by the in-house Python code (https://github.com/hikmetc/IQCAI). The power graph was drawn using GraphPad Prism version 9 (San Diego, USA).

Results
The random forest model's performance characteristics on the initial test data set (n: 6,400) are given in supplementary material and Table 2. The accuracy from tenfold crossvalidation was 95.79% ± 0.23%.
Pfr and Ped of IQC rules, EWMA, CUSUM, and Random Forest Model are shown in Table 3. While lower ARL and Pfr values are preferred for QC rules, optimum Ped should be as high as possible. The highest Pfr (0.0404) belonged to the 1-2s rule. On the other hand, the 12x rule had the lowest Pfr (0.0004). Pfr and Ped for different degrees of systematic errors were represented in the power graph shown in Table : Definitions of the internal quality control (QC) terms (, ).
Conventional rules: -s: Reject a run exceeding  SD from the target (mean) value. -s: Reject a run exceeding  SD from the target (mean) value. -s: Reject when two consecutive results exceed  SD from the target (mean) value. -s: Reject when four consecutive results exceed  SD from the target (mean) value. x: Reject when eight consecutive results are found in one side of the target (mean) value. x: Reject when  consecutive results are found in one side of the target (mean) value. x: Reject when  consecutive results are found in one side of the target (mean) value. Trend detection rules: EWMA rule: Reject a run when EWMA value exceeds control limits.    Figure 2. The random forest model had the highest Ped for systematic errors lower than 1 SD. Table 4 presents ARL values of internal QC rules, EWMA, CUSUM, and Random Forest Model. The lowest average number of QC runs (events) before false rejection (ARLfr) (25) belonged to the 1-2s rule, and the highest ARLfr (1,059) was detected for the 4-1s rule. Figure 3 shows that the lowest average number of QC runs (events) required to detect out-of-control condition (ARLed) values belonged to the 1-2s rule for all degrees of systematic errors.

Discussion
The in-house code using the random forest algorithm successfully incorporated conventional internal QC rules, EWMA, and CUSUM. The error detection performance of the random forest model was found to be successful, especially for small systematic errors (<1 SD). The error detection, false rejection, and average response metrics of conventional QC rules, EWMA, CUSUM, and the novel model were extensively presented. All of the QC models were able to achieve acceptable Pfr (<0.05) for one QC SD, standard deviation; IQC, internal quality control; EWMA, exponentially weighted moving average; CUSUM, cumulative sum; λ, weighting factor of EWMA; ARLfr, average number of QC runs (events) before false rejection; ARLed, average number of QC runs (events) required to detect out-of-control condition. measurement per run. However, the current study simulated only one QC measurement per run, and the Pfr values will double and triple in the case of two and three QC measurements per run, respectively [2]. As illustrated in Figure 2, while QC rules like 2-2s and 1-3s were suitable for detecting larger systematic errors, random forest, CUSUM, and EWMA approaches could identify smaller systematic errors or trends.
In the present study, the 1-2s rule showed the highest Pfr (0.04). Thus, 1 out of every 25 QC results revealed false rejection with the 1-2s rule (Table 3). Additionally, when the 1-2s rule is applied to 25 different tests, it can be assumed that 1 test will be rejected every day. The desirable QC rule's Pfr should be as low as possible [1]. Therefore, the 1-2s rule should be implemented cautiously. Surprisingly, Rosenbaum et al. [17] reported that 16 out of 21 clinical laboratories of the USA's reputed academic centers used only 2 SD rule. Figure 3 and Table 4 showed that the 1-2s rule had the lowest ARLfr and ARLed values. The possible motivation for 1-2s rule usage [17] might be these low ARL values, resulting in frequent QC alerts as warnings in daily practices. Nevertheless, the Ped of 1-2s rule was lower than 8x, 10x, 12x, EWMA, CUSUM, and random forest model for most of the degrees of systematic error. Thus, the 1-2s rule should be considered as a warning alert or be used in a multirule scheme.
The 1-3s rule could not detect errors with a 0.9 probability of error detection up to 4 SD of systematic error (Table 3, Figure 2) despite its acceptable Pfr (0.0018). Hence, the 1-3s rule was not sufficient for stand-alone usage. The 2-2s rule had a similar Pfr with the 1-3s rule. However, the 2-2s rule's Ped was higher than the 1-3s rule for the systematic errors higher than 1 SD (Table 3, Figure 2). The 4-1s rule was able to identify the systematic error of 3 SD with higher than 0.9 Ped. However, as shown in Table 4, the ARL value of the 4-1s rule for large systematic errors (3-4 SD) was 4, which may lead to a lag in stand-alone usage.
Although 8x, 10x, and 12x rules showed acceptable Ped for systematic errors higher than 2.25 SD (Table 3), ARL values were 7, 9, and 11 for 8x, 10x, and 12x rules, respectively (Table 4). Thus, these rules were not sufficient for detecting larger errors like 2 SD or more individually. On the other hand, Ped values of 8x, 10x, and 12x rules were unacceptable (<0.9) for the systematic errors lower than 2 SD (Table 3, Figure 2). These findings showed that 8x, 10x, and 12x rules were inefficient when utilized individually for the statistical QC procedure.
The weighting factor constitutes the backbone of the EWMA approach. Linnet modified the EWMA approach using 2 and 3 SD as control limits and 0.5 as the weighting factor. This modified EWMA approach outperformed conventional QC rules and multirules for detecting larger errors such as 2 SD [11]. However, original control limits and weighting factors of 0.2, 0.1, and 0.05 were utilized in the present study to fit for its purpose of detecting trends and small systematic errors [5]. Overall, Pfr values were quite low for EWMA approaches, regarding the acceptable limit of 0.05. The Ped values from highest to lowest belonged to EWMA (λ=0.05), EWMA (λ=0.1), and EWMA (λ=0.2), respectively (Table 3, Figure 2). Hence, the EWMA approach using λ value of 0.05 was able to detect lower systematic errors with acceptable performance compared to other λ values ( Figure 2).
As shown in Figure 2, the CUSUM approach's performance was acceptable for identifying small systematic errors like the EWMA approach (Figure 2). On the other hand, the Pfr value of CUSUM (0.012) was higher than the Pfr values of EWMA (0.0016-0.0044) and the Random forest model (0.0048) ( Table 3). Westgard et al. [18] showed that the combined usage of CUSUM and Shewhart charts enhanced Ped. Furthermore, the combined utility of EWMA and Shewhart chart (conventional QC rules) was recommended previously [19] due to the inertia effect that leads to a lag in error detection, particularly when a lower weighting factor is used [7]. Likewise, in the present study, EWMA and CUSUM approaches could not detect substantially large systematic errors such as 3 SD with one QC measurement per run, as inferred from ARLed values (Table 4). Thus, laboratory professionals should consider the ARL values when combining different QC procedures and determining the number of QC measurements while preserving acceptable limits of Pfr. For instance, the Pfr value of CUSUM is 0.012; thus, the remainder Pfr of 0.038 can be reserved for other rules having lower ARLed while preserving acceptable Pfr (0.05). The remaining Pfr can also be reserved for rules that can detect random errors. For example, a laboratory professional may intend to detect systematic errors higher than 1.5 SD and substantial random errors, in this case, EWMA (λ=0.2) may be preferred for preserving Ped ≥0.9, and may combine with the "1-3s" rule which can cover random errors [20] while maintaining Pfr ≤0.05.
The random forest model showed the best Ped for 0.25 SD, 0.5 SD, and 0.75 SDs of systematic errors. The Ped values for systematic errors equal to or higher than 1 SD were similar to EWMA (λ=0.05) and CUSUM approaches ( Table 3). In addition, Ped values of the random forest model were acceptable for systematic errors higher than 0.5 SD while preserving acceptable Pfr (0.0048). However, ARLed values did not let the random forest model be used individually for one QC measurement per the run scheme, as shown in Table 4. Therefore, the random forest model should be used in combination with other conventional QC rules having lower ARLed for large systematic errors, or more than one QC measurement per run is required regarding ARLed values.
The CUSUM approach had the highest feature importance among the input variables of the random forest model, followed by EWMA (λ=0.05) and EWMA (λ=0.1) (Figure 4). In contrast, the 4-1s rule had the lowest importance for the random forest model to predict systematic error (Figure 4). Therefore, it can be suggested that CUSUM and EWMA approaches are preferably included in the internal QC procedure to detect systematic analytical errors even if the machine learning model is not used.
According to CLSI C24ed4, procedures with quite good analytical performance regarding their medical needs may be tracked with less strict QC rules like 4-1s and 1-3s. On the other hand, procedures having marginal performance may need a combination of QC rules [1]. The error detection ability of QC rules varies in magnitude and type of an out-of-control condition. While rules like 2-2s and 1-3s are useful to detect larger shifts, trend detection rules such as EWMA and CUSUM aid in detecting smaller shifts and trends [1]. Although most laboratories implement simple rules such as 2 SD in daily practices [17], with the evolution of technology and computer science, implementation of multiple QC procedures and state-ofart tools like machine learning models have become applicable.
This is the first study implementing a machine learning model as an internal QC rule in laboratory medicine. The source codes and the developed random forest model were also shared publicly (https://github.com/hikmetc/IQCAI). The present study also provided an extensive performance evaluation of well-known internal QC rules, which had been investigated by only a few studies [9][10][11]. The Pfr values of 1-2s and 1-3s were reported as 0.049 and 0.0020 by Westgard et al. [10] and 0.089 and 0.0054 by Parvin [21], respectively. In addition, the latter study reported higher Ped values of these rules than the present study. However, 2 SD covers at least 95% of the Gaussian distributed results and does not conform with the Pfr higher than 0.05 for the 1-2s rule. Nevertheless, the present study provides source codes of the simulation to maintain its reproducibility. The outcome metrics established in this study will help laboratory professionals planning internal QC procedures. It should be kept in mind that while EWMA, CUSUM, and the novel random forest model can detect small systematic errors or trends with their higher Ped, this ability may lead to frequent interruption due to small but not clinically substantial errors in the daily laboratory routine. Therefore, laboratory professionals may prefer to choose a strategy between detecting small shifts before becoming substantial errors or detecting only substantial errors using conventional rules like 1-3s and 2-2s.
The present study was limited by the absence of the multirules' performance evaluation. Multirules are formed by the combined utility of conventional QC rules and were not compared with the random forest model presented in this study. Linnet demonstrated that the EWMA approach had better performance than multirules [11]. While we showed that the random forest model outperformed EWMA approaches, the weighting factor and control limits used in the present study differed from Linnet's. Thus, further research is needed to address the concomitant comparison of multirules, EWMA, CUSUM, and random forest models. The second limitation was that the present study simulated only one QC measurement per run scenario. Assuming different concentration levels are tested in the internal QC procedure, further studies about higher numbers of QC measurements need to be conducted. Another limitation was that only systematic errors were simulated in the present study. Laboratory professionals face random errors in daily routine; therefore, more research is needed to investigate both random and systematic errors for conventional QC rules, EWMA, CUSUM, and random forest models.

Conclusions
The random forest model presented in this study showed acceptable Ped for most degrees of systematic error. However, the ARL values of the model require the combined utility of the random forest model with conventional QC rules having lower ARL values or more than one QC measurement is required. The present study has reported extensive outcome metrics of internal QC procedures, which may help laboratory professionals plan their procedure with acceptable Ped and Pfr values. Ped: Probability of error detection. Pfr: Probability of false rejection Note: The ARL value is related to the response time of QC rules. Therefore, the ARL value of a multi-rule scheme would be the minimum ARL among the combined rules.
Example: Objective: to detect systematic errors higher than 1.5 SD with Ped >0.9 and Pfr <0.05 and to determine systematic errors >3 SD with ARL of 1. Ped values of EWMA are higher than 0.9 for systematic errors ≥1.5 SD; thus, the combined Ped will be higher than 0.9. The minimum ARL value for systematic errors >3 SD among 1-3s rule and EWMA (λ=0.2) is 1, which belongs to the 1-3s rule (Table 4). Overall, the present scheme could detect systematic errors >3 SD with 1 ARL and determine systematic errors ≥1.5 SD with Ped >0.9 while preserving Pfr <0.05.