Development and validation of a novel model to predict pulmonary embolism in cardiology suspected patients: A 10-year retrospective analysis

Abstract As there are no predictive models for pulmonary embolism (PE) in patients with suspected PE at cardiology department. This study developed a predictive model for the probability of PE development in these patients. This retrospective analysis evaluated data from 995 patients with suspected PE at the cardiology department from January 2012 to December 2021. Patients were randomly divided into the training and validation cohorts (7:3 ratio). Using least absolute shrinkage and selection operator regression, optimal predictive features were selected, and the model was established using multivariate logistic regression. The features used in the final model included clinical and laboratory factors. A nomogram was developed, and its performance was assessed and validated by discrimination, calibration, and clinical utility. Our predictive model showed that six PE-associated variables (age, pulse, systolic pressure, syncope, D-dimer, and coronary heart disease). The area under the curve – receiver operating characteristic curves of the model were 0.721 and 0.709 (95% confidence interval: 0.676–0.766 and 0.633–0.784), respectively, in both cohorts. We also found good consistency between the predictions and real observations in both cohorts. In decision curve analysis, the numerical model had a good net clinical benefit. This novel model can predict the probability of PE development in patients with suspected PE at cardiology department.


Introduction
Pulmonary embolism (PE) is a serious and potentially life-threatening medical condition resulting from blood clot formation in one or more arteries in the lungs [1].Patients with cardiovascular diseases and cancer [2], particularly those with heart failure, atrial fibrillation, and coronary artery disease, are at an increased risk of developing PE because of the pro-thrombotic state of their conditions [3].Furthermore, patients who undergo cardiac surgeries or interventions [4], such as coronary artery bypass grafting and percutaneous coronary intervention, are at a higher risk of developing PE.PE can have a significant effect on the prognosis and quality of life of patients with cardiovascular diseases [5]; therefore, timely diagnosis and treatment are essential for improving patients' outcomes [6].Current diagnostic methods have limitations and may lead to unnecessary testing and delays in treatment [7].
Computed tomography pulmonary angiography (CTPA) is the gold standard diagnostic method for PE [8], but it is associated with high radiation exposure, contrastinduced nephropathy, and allergic reactions.Furthermore, CTPA may not be appropriate for some patients such as pregnant women or those with kidney disease.Therefore, there is a need for noninvasive and accurate methods to identify patients with high risk of PE.Several studies have been conducted to develop and validate models for predicting the risk of PE in various populations including emergency department patients [9], pregnant and postpartum women [10], and patients with underlying medical conditions such as cancer [11] and heart failure [12].One example of a scoring system is the Wells score, which is widely used to assess the probability of PE in patients with suspected PE [13].The Wells score is based on clinical variables such as prior deep vein thrombosis symptoms, clinical symptoms, and presence of limb edema.Other scoring systems [14] such as the Geneva score and the PE rule-out criteria have also been developed and validated in patients presenting with a primary complaint of shortness of breath or chest pain, and it is reasonable to use it for either of these symptoms.
In recent years, there has been increasing interest in developing predictive models for PE in specific populations.For example, Jen et al. developed a new model that outperforms existing predictive tools in all patients with PE [15].Lin et al. developed a new clinical predictive model that can identify patients who are at high risk of venous thromboembolism and help provide medical intervention in patients with diabetes and the general population [16].While PE predictive models have been developed and validated in various populations [17], there are still some shortcomings and deficiencies that need to be addressed.One limitation of existing PE predictive models is their lack of generalizability across different patient populations.For example, a model developed in a population of emergency department patients may not be applicable to patients in a primary care setting or those with underlying medical conditions such as cancer or heart failure.In addition, some PE predictive models may not fully capture the complex interactions between various risk factors and their contribution to PE development.For example, the Wells score, although widely used, does not include variables such as the presence of a hypercoagulable state, which may increase the risk of thromboembolism [18].
Currently, there is no widely accepted model or guideline for predicting the probability of PE in patients with cardiovascular disease.Therefore, there is a critical need for developing an accurate and reliable predictive model for PE in patients with cardiovascular diseases.
In this study, we bridge this gap by developing and validating a novel numerical model to predict the probability of PE in patients with cardiovascular disease.This model simplifies risk assessment and provides a userfriendly interface for medical practitioners to assess a patient's risk level.

Patient enrollment and data collection
This retrospective study enrolled patients with suspected PE at the Department of Cardiology at the Affiliated Dongyang Hospital of Wenzhou Medical University from January 2012 to December 2021.The data of 995 subjects were collected from the hospital's clinical research data platform, after baseline data clearing and extraction.The patients were randomly divided into a training cohort and a validation cohort at a ratio of 7:3.
Ethical approval: This study was approved by the Medical Ethics Committee of the Affiliated Dongyang Hospital of Wenzhou Medical University (No.: 2022-YX-160).The requirement for informed consent was waived.Patient records or information were anonymized and de-identified before our analysis.Our research was conducted in adherence with the Declaration of Helsinki.

Diagnostic criteria
The diagnosis of PE in our study was based on the criteria outlined in the European Society of Cardiology Guidelines [19], and patients who had undergone CTPA examination were classified as those having suspected PE.The diagnosis was based on the presence of a filling defect in the pulmonary artery system, including the subsegment pulmonary artery, as seen on CTPA.In addition to CTPA results, we collected patients' past medical history, clinical features, complications, and biomarker data using strictly defined indicators.For instance, we selected the lowest value of blood oxygen saturation, systolic blood pressure, and diastolic pressure from admission to CTPA, while the highest value was chosen for other indicators.A flowchart of the steps involved in PE prediction model is presented in Figure 1.

Statistical analysis
The data were analyzed using R Studio software for Windows.Categorical variables were presented as frequencies with percentages and were compared using either the χ² test or Fisher's exact test.Continuous variables were expressed as mean values with standard deviations or medians with interquartile ranges and were compared using either Student's ttest or Mann-Whitney U test.A total of 58 variables were collected for each subject.To ensure data reliability, 13 indicators with missing information in greater than 20% of patients were excluded.Multiple imputation techniques [20] using the "mice" package in R software were applied to impute the remaining missing predictor values.The optimal predictive features were selected using the least absolute shrinkage Predicting PE in cardiology patients  3 and selection operator (LASSO) regression analysis [21] with the "glmnet" package, and the numerical model was established using multivariate logistic regression analysis with the "rms" package.A nomogram was constructed using the "regplot" package in R software.The features were presented as odds ratios (ORs) with 95% confidence intervals (CIs).A two-sided p-value of less than 0.05 was considered statistically significant.

Model development, validation, and evaluation
In the training cohort, we employed LASSO regression to select the optimal predictive features and developed a multivariable logistic regression model to predict the probability of PE.To evaluate the performance of the model, discrimination, calibration, and clinical utility were assessed and validated in both cohorts.Discrimination was evaluated using the area under the receiver operating characteristic (ROC) curve (AUC) with the "pROC" package.Calibration was assessed with calibration curve analysis using the "calibrate" package.We performed decision curve analysis (DCA), clinical impact curve, and net reduction curve with the "rmda" package to quantify the net benefit under different threshold probabilities and determined the clinical utility of the model.

Study population characteristics
In this study, we excluded 13 variables with missing information in more than 20% of patients, leaving 45 variables with missing data in less than 20% of patients (as shown in Appendix 1).The multiple imputation technique was used to impute the missing data for these 45 variables, which ranged from 0.00 to 12.46%.A total of 995 subjects with suspected PE were included, and the incidence of PE in our study was 17.98%.The baseline characteristics of patients with suspected PE at the cardiology department are presented in Table 1.We randomly divided the patients into the training cohort (n = 697) and the validation cohort (n = 298).The baseline characteristics of patients in the two cohorts are shown in Table 2.There was no significant difference in each indicator between the two cohorts, except for two indicators (platelet distribution width and thrombin time).

Selected predictors and construction model
After applying the LASSO regression analysis, we identified six of 45 variables that were potential predictive features (Figure 2a and b).The optimal predictors were age, pulse, systolic pressure, syncope, D-dimer, and coronary heart disease.These six potential predictive features were used to develop the final model based on the multivariable logistic regression analysis in the training cohort (Table 3).In the training cohort, our model had a sensitivity of 69.1%, a specificity of 63.4%, a positive predictive value of 28.8%, and a negative predictive value of 90.5%.

Model visualization
Multivariate analysis revealed that age (OR = 1.022, 95% CI, 1.002-1.044),pulse (OR = 1.007, 95% CI, 1.000-1.015),systolic pressure (OR = 0.986, 95% CI, 0.974-0.999),D-dimer (OR = 2.032, 95% CI, 1.251-3.302),and coronary heart disease (OR = 1.089, 95% CI, 1.049-1.132)were independent predictors for PE (Table 3).The nomogram (Figure 3) shows the predictive model for PE based on the six selected variables: age, pulse, systolic pressure, syncope, D-dimer, and coronary heart disease.To use the nomogram, each variable was assigned a score based on its value, and the scores were summed to obtain a total score.A vertical line was then drawn from the total score axis to the probability axis to obtain the estimated probability of PE.For example, if a patient is 80 years old, has a pulse rate of 160 beats per minute, systolic pressure of 82 mmHg, no history of syncope, a D-dimer level of 4.3 mg/L, and no history of coronary heart disease, then the total score would be 250.The vertical line from the total score of 250 intersects the probability axis at approximately 0.21, indicating a 21% estimated probability of PE.

Model validation and evaluation
The discriminatory ability of the numerical model, as measured by the AUC, was 0.721 (95% CI, 0.676-0.766) in the  4a and b).The calibration plots, shown in Figure 5a and b, reveal good consistency between predicted probabilities and actual outcomes for both the training and validation cohorts, as evidenced by the proximity of the apparent calibration curve to the ideal line.The DCA curves, presented in Figure 6a and b, demonstrate that the numerical model had a favorable net clinical benefit, with screening strategies based on our nomogram PE risk estimates yielding greater net benefit than both screennone and screen-all strategies within the threshold probability range of 0.08-0.50.Furthermore, the clinical impact curve and net reduction curve depicted in Figures 7 and 8, respectively, indicate that our nomogram has a significant net clinical benefit.

Discussion
In this study, we developed a novel predictive model for the probability of PE in patients with cardiovascular diseases.Our model utilized six variables representing high-risk disease, namely age, pulse, systolic pressure, syncope, D-dimer, and coronary heart disease, all of which are easily obtainable clinical features and biomarkers during routine health assessments.Our findings showed that the model exhibited good discrimination with an area under the ROC curve of 0.721 (95% CI, 0.676-0.766),indicating its ability to distinguish between patients with and without PE.Furthermore, the calibration plots demonstrated that the model had a good consistency between predicted and observed probabilities in both training and validation cohorts.Additionally, the DCA suggested that our model had a favorable net clinical benefit within the threshold probability range of 0.08-0.50.
Predicting PE in cardiology patients  7 Currently, there are no predictive models available for predicting the risk of PE specifically in patients with cardiovascular disease.However, previous studies have developed predictive models for venous thromboembolism in other patient populations.For instance, Zhou et al. [22] developed a predictive model for PE in patients with cough or chest pain based on laboratory variables, which had an AUC of 0.692.Li et al. [23] developed a clinical predictive model for lower extremity deep venous thrombosis in patients admitted to the neurointensive care unit, which had an AUC of 0.817.Zhang et al. developed a predictive model for postoperative venous thromboembolism [24].
Our study has several strengths compared to previous studies.First, we utilized the LASSO regression method to select the optimal predictive features, which improved the accuracy and robustness of the predictive models.Second, our model used only six readily available high-risk variables, which makes it simpler and more efficient to use in clinical practice.Finally, our model specifically focuses on the prediction of PE in patients with cardiovascular disease, which makes it highly relevant for clinicians dealing with this population.D-dimer is an indicator reflecting fibrinolytic function and can be used to diagnose thrombotic diseases.According to our study, D-dimer (OR = 2.032, 95% CI, 1.251-3.302)was identified as an independent predictor for increased risk of PE.This finding is consistent with those of previous research [25] that has linked high D-dimer levels with increased risk  of developing PE.Although D-dimer is currently the only biomarker used in routine clinical practice to predict PE, its specificity is limited, leading to high rates of false-positive results.Elderly patients have increased hospitalization rates and the highest inpatient mortality due to PE, as demonstrated in a large-sample study conducted from 2000 to 2015 [26].Another retrospective study indicated that age is associated with the severity of submassive PE stadium [27], and our model also found age (OR = 1.022, 95% CI, 1.002-1.044)to be a high-risk factor for PE, which is consistent with the findings of previous research.Most of the factors in our model were positively associated with the risk of PE, except for systolic blood pressure, which was negatively associated.Low systolic pressure has been linked to an increased risk of PE-related mortality, as shown in a previous study [28].
The relationship between lower systolic blood pressure and higher PE occurrence is primarily due to the pathophysiology of PE itself.When a blood clot (or embolus) travels through the bloodstream and lodges in the pulmonary arteries, it prevents effective oxygen exchange in the lungs.This can lead to acute right heart failure because the right side of the heart has to pump harder against the increased resistance in these blocked arteries.Our data also revealed that pulse rate was included in the model to predict PE.Consistent with our findings, a previous study [29] identified pulse rate as a good predictor of PE.This happens because when a blood clot obstructs the pulmonary arteries, the right ventricle of the heart has to work harder to pump blood through these vessels.This increased workload can result in a faster heart rate.Coronary heart disease [30] has also been identified as a factor that affects the risk of PE, which is similar to the indicators present in our model.Our study obtained an  interesting result that patients with syncope are less likely to develop PE.Syncope is a common clinical symptom of PE [31], but our findings differ from that generally observed.We analyzed the clinical information of 48 patients with syncope and found that the possible explanation for this interesting result is that these patients mainly had vasovagal syncope or orthostatic hypotension syncope.They were sent for CTPA only for exclusion and did not have a high suspicion of PE.Additionally, patients with PE who experienced syncope were sent to the respiratory department for treatment.The development of an accurate model for predicting the probability of PE in patients with cardiovascular disease has significant implications in clinical practice.This model can aid healthcare professionals in making timely and personalized diagnoses, leading to the formulation of effective and personalized treatment plans.By identifying high-risk variables such as age, pulse, systolic pressure, syncope, D-dimer, and coronary heart disease, this model can assist in identifying patients who require further diagnostic workup or more aggressive treatment, while reducing the need for unnecessary CTPA screening.Furthermore, the use   of a nomogram to visualize the model's output makes it easier for healthcare professionals to interpret the results and communicate them to patients.The high net clinical benefit demonstrated by the clinical decision curve, clinical impact curve, and net reduction curve analyses suggests that this model has the potential to improve patient outcomes and reduce healthcare costs [32].
However, there are some limitations to our study.For example, the sample size of this study was smaller than that in some previous studies, which may limit the generalizability of our findings.Additionally, because our model was developed using retrospective data, there may be a risk of selection bias and confounding.Finally, external validation in other clinical settings is required to assess the generalizability and reliability of our model.
In conclusion, the novel model developed in this study has the potential to become a valuable clinical tool for predicting the probability of PE in patients with cardiovascular diseases, leading to more accurate diagnoses and personalized treatment plans.However, it is important to note that this study is retrospective, and further prospective studies are required to validate the accuracy and clinical usefulness of this model.

Figure 1 :
Figure 1: Flowchart of the steps for predicting PE diagnosis.

Figure 2 :
Figure 2: Tuning parameter selection using LASSO regression in the training cohort.(a) LASSO coefficient profiles of the clinical features.(b) Optimal penalization coefficient lambda was generated in LASSO through tenfold cross-validation.The lambda value of the minimum mean square error is shown in the figure.

Figure 3 :
Figure 3: Nomogram based on the combination of the six indicators was developed using logistic regression analysis.If a patient has a total score of 250, then the probability of developing PE is 0.21.DD, D-dimer; CHD, coronary heart disease.

Figure 4 :
Figure 4: ROC curves of the model to distinguish PE from non-PE in the training (a) and validation (b) cohorts.

Figure 5 : 11 Figure 6 :
Figure 5: Calibration curves of the model in the training (a) and validation (b) cohorts.A perfect accurate predictive model will generate a plot where the probability of the actual observed and prediction completely fall along the ideal line (dashed line).The apparent calibration curve (blue line) represents the calibration of the model, while the bias-corrected curve (red line) is the calibration result after correcting the optimism with fivefold cross-validation.

Figure 7 :
Figure 7: Clinical impact curve of the model in the training (a) and validation (b) cohorts.The red line indicates the number of subjects who are judged as being at high risk by the model under different probability thresholds.The blue line indicates the number of subjects who are judged by the model to be at high risk and who actually have an outcome event under different probability thresholds.

Figure 8 :
Figure 8: Net reduction curve of the model in the training (a) and validation (b) cohorts.Using predictive models can reduce the number of interventions by 40% at a risk threshold of 30%.

Table 1 :
Baseline characteristics of the study subjects

Table 2 :
Baseline characteristics of the enrolled patients in the training and validation cohorts

Table 3 :
Final model coefficients