Detecting myocardial scar using electrocardiogram data and deep neural networks

: Ischaemic heart disease is among the most frequent causes of death. Early detection of myocardial pathologies can increase the benefit of therapy and reduce the number of lethal cases. Presence of myocardial scar is an indicator for developing ischaemic heart disease and can be detected with high diagnostic precision by magnetic resonance imaging. However, magnetic resonance imaging scanners are expensive and of limited availability. It is known that presence of myocardial scar has an impact on the well-established, reasonably low cost, and almost ubiquitously available electrocardiogram. However, this impact is non-specific and often hard to detect by a physician. We present an arti ﬁ cial intelligence based approach — namely a deep learning model — for the prediction of myocardial scar based on an electrocardiogram and additional clinical parameters. The model was trained and evaluated by applying 6-fold cross-validation to a dataset of 12-lead electrocardiogram time series together with clinical parameters. The proposed model for predicting the presence of scar tissue achieved an area under the curve score, sensitivity, speci ﬁ city, and accuracy of 0.89, 70.0, 84.3, and 78.0%, respectively. This promisingly high diagnostic precision of our electrocardiogram-based deep learning models for myocardial scar detection may support a novel, comprehensible screening method.


Introduction
Ischaemic heart disease (IHD) is a major cause of death in countries of Western life-style and therefore not only causes personal suffering but also has a high socioeconomic impact (Benjamin et al. 2019;World Health Organization 2018). Early detection of IHD with subsequent treatment initiation can reduce mortality and alleviate the course of the disease. As a preliminary stage of IHD, development of myocardial scar (MS) is a potential indicator and early detection of MS together with appropriate therapeutic measures can prevent IHD or at least reduce effects of the disease.
Non-invasive state of the art methods for diagnosis of MS all suffer from drawbacks impeding their use in comprehensible screening. Magnetic resonance imaging (MRI) is considered as the gold standard for diagnosing MS by achieving high sensitivity and specificity (Winau et al. 2018). Intramyocardial late Gadolinium enhancement (LE) of the contrast agent can be visualised by MRI and has been shown to serve as an indicator for scar tissue (Kim et al. 1999). The drawbacks of the application of MRI as a screening method are its limited availability, high costs as well as the time demanding imaging process with a high noise exposure affecting the comfort of the patients (Oikarinen et al. 2013).
Presence of MS has an impact on the electrocardiogram (ECG) signals, but the effects are heterogeneous and complex in interpretation so that health-care professionals do not achieve the required diagnostic sensitivity in clinical routine. For this reason, the abundantly available ECG cannot be applied as an alternative for MRI (Inoue et al. 2017;Markendorf et al. 2019).
Scoring systems for the evaluation of ECG recordings have been developed since the early 1970s, for example the Selvester QRS score (Selvester et al. 1971;Selvester et al. 1985). However, it has been shown that this score can only achieve high sensitivity and specificity for the detection of large sized MS, whereas the diagnostic performance of small or medium sized MS is limited . Numerous studies used machine learning (ML) or deep learning (DL) algorithms for ECG-based diagnosis of cardiac diseases. For example, detection of arrhythmia has been achieved with classic ML approaches such as support vector machines (SVM) (Albuquerque et al. 2018). For identification of acute myocardial infarction, several convolutional neural network (CNN) models have been proposed, achieving high diagnostic performance (Acharya et al. 2017;Baloglu et al. 2019;Liu et al. 2018;Strodthoff and Strodthoff 2019). A CNN is a bio-inspired neural network architecture which is based on an internal structure with analogies to the visual cortex of the animal eye (Hubel and Wiesel 1962;LeCun et al. 1989).
For the automated detection of MS, so far only the classic ML method SVM has been used (Dima et al. 2013;Goovaerts et al. 2019). This method requires pre-extraction of features to make ECG recordings accessible for algorithmic processing. These features either have to be defined or identified using signal-processing procedures such as the time-domain morphology algorithm (Mazomenos et al. 2012) or variational mode decomposition (Dragomiretskiy and Zosso 2014).
In this paper, we present a DL model for the detection of MS that uses raw ECG recordings and corresponding clinical parameters for each patient (e.g. cardiovascular risk factors), without prior knowledge about medical interpretations of patterns. Such a model may be further elaborated to allow future clinical use as a simple, rapid, and cheap screening test for MS. Parameters of the model were calibrated using supervised learning on the given set of ECG recordings. During this process, the model approximates the underlying mathematical decision function.

Results
In this section, we first describe the data pipeline used for data cleansing and preprocessing. Second, we present two DL architectures. One for a model based on raw ECG recordings and a second architecture based on raw ECG recordings and additional clinical parameters. For extensive information about the applied methods, we refer to section Materials and methods.

Data pipeline
Data for both training and application of the trained models is supplied through a data pipeline consisting of several preprocessing steps arranged in two separate branches for ECG recordings and for clinical parameters, respectively ( Figure 1).
First, ECG recordings stored in XML HL7v3 format (Health Level Seven, Inc. 2020) are loaded in which then, artefacts, e.g. zero lines at the beginning and end of the recordings, are removed by cropping. Then the values are scaled to locate them in a narrower range, which makes them more accessible for deep learning algorithms. In the next step, data augmentation procedures (subsampling) are applied to compensate for the relatively small number of training records before data is saved in the last step of the data pipeline.
Parallel to preprocessing the ECG recordings, clinical parameters run through a second branch of the pipeline. After initial processing, i.e. loading and categorisation, attributes are transformed into one-hot encoded vectors which are then concatenated into a single vector to be used as input for the neural network.
The pipeline provides data for two experimental setups in which models for an ECG-based model and for a combined model based on ECG data and clinical parameters are examined. Both strands with preprocessed ECG samples and encoded clinical parameters can be combined using an anonymised distinct identifier (record-ID). This enables to link clinical parameters to the corresponding ECG recording so that these two types of data can either be used together or separately.

ECG model
CNNs have originally been designed for pattern identification in images (LeCun et al. 1989) and have already been used for ECG analysis, yielding considerable performance for specific tasks (e.g. Strodthoff and Strodthoff (2019)). Therefore, CNN are a promising architectural choice for the task of detecting MS in ECG recordings.
As starting point of our DL model, we used the CNN architecture as proposed by Strodthoff and Strodthoff (2019), which was developed for identification of acute MI. The following characteristics of the architecture were optimised for our specific medical purposes by grid search optimisation: -Number of convolutional layers -Number of filters in convolutional layers -Kernel/filter size of convolutional layers -Pooling (yes/no) after convolutional layers -Additional dropout between pooling and convolutional layers with varying dropout rates Furthermore, we extended the initial starting configuration with additional fully-connected dense layers, experimentally adjusting their number, width, and the respective dropout rates. The final optimised CNN (Appendix, Figure 3) consists of an input layer (width = 2000) followed by a successive series of convolutional and pooling layers reducing the input to a width of 64. In between these layers, dropout with a rate of 0.25 proofed to successfully prevent the model from overfitting. Four fully-connected layers (width = 64), each followed by a dropout rate of 0.1, aggregate the information into a final layer. This layer yields a probability distribution over two classes (presence of MS versus absence of MS) based on a softmax activation.

Combined model
Besides the experiments with ECG recordings alone, we investigated whether using ECG data together with clinical parameters of the individual patient in a combined DL model could improve diagnostic performance. Therefore, the CNN model described before was combined with a fully-connected feedforward network (FFN) which was applied to the records with clinical parameters (Appendix, Figure 4). The characteristics of this network were optimised analogously to that of the CNN. The final FFN structure has an input layer of 55 nodes followed by four fully-connected layers (width = 256). All layers but the first are connected with dropout rates of 0.5.
The outputs of the CNN and FNN form the 4-node input for a final FFN with three hidden layers (width = 4), each having a dropout rate of 0.25. As for the ECG-only model, the final output layer of the combined model contains two nodes for the overall prediction (presence of MS, absence of MS) and uses a softmax activation. The combined model was trained in a separate end-to-end training process and does not rely on pre-trained sub-models.

Experimental results
We now present performance measures for both models. Table 1 contains areas under the curve (AUC) of receiver operating characteristic analyses, sensitivities, specificities, and accuracies of the ECG-only (listed in the columns on the left hand side) and the combined model (listed on the right hand side). The measures were determined on subsample-level and on patient-level applying 6-fold cross-validation with 50 repetitions for each split, see paragraph Performance metrics of section Materials and methods.
With an outlook to a potential clinical application, the positive predictive value (PPV) and negative predictive value (NPV) of the ECG model would be 73.0 ± 17.5 and 72.3 ± 11.3%, respectively. The PPV and NPV of the combined model would be 84.2 ± 12.2 and 78.2 ± 14.0%, respectively.
The results of both models are discussed comparatively and related to alternative approaches in the following section. Graphs of the receiver operating characteristic (ROC) analyses can be found as Figure 5 in the Appendix.

Discussion
In this section, we discuss the main findings of this study, compare our results to state of the art approaches, and point out limitations as well as potential directions of future work.

Model performance
Our results presented in Table 1 show that combining ECG data and clinical parameters produced the best average performance over the six cross-validation runs with respect to each of the measures (AUC = 0.89, Sensitivity = 70%, Specificity = 84.3%, and Accuracy = 78%). The same holds for the maximal values and for the results of the single runs. Apparently, supplementary information in the form of clinical parameters can improve model performance.
We observed high measures for AUC, sensitivity and specificity but not necessarily in the same models. The model with the highest AUC can be seen as the model with the best compromise between specificity and sensitivity and depending on the application, models selected based on maximum sensitivity or maximum specificity can be more appropriate. Our approach computes models with good performance for all three perspectives.
The area under the receiver operating characteristic curve was used as performance measure as this statistical method is well established in medical studies evaluating diagnostic tests. With specific clinical requirements, such an unweighted compromise might not be ideal. For example, within a diagnostic rule-out approach optimal sensitivity is aimed, whereas in case of a diagnostic rule-in algorithm, a high specificity is of greater importance.
Due to the relatively small case number of n = 114, we augmented data by subsampling and applied 6-fold crossvalidation. Therefore, our validation datasets contained data from 19 patients in each cross-validation run so that validation was performed with n = 1900 on subsamplelevel and n = 19 on patient-level. Predictions on subsamplelevel were condensed into patient-level predictions by computing the mean of all the subsample predictions of the corresponding patient. Surprisingly, standard deviation of model performance on subsample-level is close to that on patient-level despite the different size of the validation datasets. Furthermore, there is no significant difference between the model performance on subsample-level and on patient-level, indicating that subsampling together with accumulating predictions to patient-level is viable.
More elaborate statistical evidence, e.g. confidence intervals or test measures, cannot be reported as the influence of fluctuations is too strong for the sample size given (see discussion in Isaksson et al. (2008) and Bengio and Grandvalet (2004)). We therefore applied crossvalidation with repeated training and validation (50 times for each of the six split-variants), reporting mean and variance values for sensitivity, specificity, accuracy, and AUC.
Using accuracy must be handled with care as it is considered to fail as performance metric for imbalanced class distributions. Table 3 shows that patients with LE and those without are almost equally distributed so that accuracy is applicable in our case.
Best model according to highest AUC on patient-level marked with *. AUC = area under receiver operating characteristic curve; NPV = negative predictive value; PPV = positive predictive value; SD = standard deviation.
Comparison to existing approaches Table 2 relates the performance of our models to that of applying the Selvester score (Selvester et al. 1971;Selvester et al. 1985), human-based ECG evaluation, and two SVM-based approaches. The measures of both our models deliver comparably high performance measures and reach (ECG-based model) or exceed human performance (combined model) as published by Asch et al. (2006), Carpenter et al. (2015), and Markendorf et al. (2019).
Comparing our results with that of the other methods listed in Table 2, it must be taken into account that the underlying cohorts differ in size, structure and time of study. Considering the larger training sets used by Goovaerts et al. (2019) and by Dima et al. (2013) and the limited number of cases used by Chaudhry et al. (2017) and Bignoto et al. (2018), our results are similar. For a more substantiated comparison, the methods would have to be applied to the identical data, see Bengio and Grandvalet (2004). It is noteworthy that only Goovaerts et al. (2019) uses AUC as performance measure which is of high relevance in the medical context and is the only measure that does not depend on a single threshold.
In summary, our proposed model achieves reasonable performance on a relatively small case number used for training, which is a promising result, encouraging further research and validation.

Limitations
Over recent years, DL has frequently been used in the medical context. A recent meta-analysis has shown that existing DL models detecting diseases by analysis of medical images achieve diagnostic performance equivalent to that of health-care professionals (Liu et al. 2019). However, only 25 studies out of 31587 under investigation could be included in this meta-analysis. The rest had to be left out for reasons such as missing external validation or lack of transparency. Limited amount of appropriate data is often one of the main obstacles for a comprehensive use of DL models in general and particularly in healthcare.
In case of the approach presented in this article, the limited number of 114 patients and the heterogeneity within the data also constituted a key limitation for model development. Due to this limitation, we exclusively performed an internal validation (in-sample validation) using 6-fold cross-validation, see Altman and Royston (2000) for a description of different types of validation sets. An external validation (out-of sample validation) with data derived from new, previously unknown patients is currently missing and will be critical to ensure the high performance measures achieved in this study.
It should further be mentioned that the data partly includes redundant information, e.g. different leads of the ECG may contain similar patterns, or that the clinical parameters body-mass-index (BMI) and body-surface-area (BSA) are not completely independent.
Furthermore, all data used for training and validation was derived from patients scheduled for an MRI examination. This means that ECG recordings, MRI diagnosis, and clinical parameters used in our approach were obtained in patients with a relevant risk for cardiovascular diseases. It is uncertain whether the performance achieved

Future work
Currently, enrolling patients and therefore collecting further medical data is an ongoing process of our project and will allow us to use a more comprehensive dataset for model training and hyperparameter optimisation in the future. A larger set of medical data will also enable to holdout data for external validation. Therefore, our main focus will be to test the generalisation performance of our DL model by reproducing the promising performance measures achieved in the presented study for this out-ofsample data. Furthermore, an interesting aspect for validation will be the comparison of the diagnostic performance of our DL model with established diagnostic approaches used in daily clinical routine.
Another important issue for future work is to elucidate the specific ECG patterns and relevant clinical parameters leading to the diagnostic prediction of the DL model. Potential methods from the emerging field of explainable artificial intelligence (XAI) such as layer-wise relevance propagation (LRP) (Bach et al. 2015), local interpretable model-agnostic explanations (LIME) (Ribeiro et al. 2016) and deep learning important features (DeepLIFT) (Shrikumar et al. 2017) may provide transparency and traceability and thereby dissolve the black-box nature of the DL model. Such methods can further facilitate the use of DL methods in clinical routine due to an increase of trust (Andras et al. 2018).

Conclusion
In spite of the comparatively small case number in the presented study, the developed DL models reached promising performance measures to detect MS. The procedure can be applied to ECG data and clinical parameters directly and does not require any preextraction of features, which enables simple and flexible application. The potential reduction of expensive MRI and the possible use as screening tool for MS justify further experimentation. Future implementations that provide validation and explainability of DL predictions are essential to increase credibility and trust in this promising technology and will help to establish its use for medical application.

Medical data
Data on clinical parameters, ECG and MRI are used from 114 patients enrolled in the Kerckhoff Biomarker Registry for the training and evaluation of the deep neural networks.
Clinical cohort: Participants of whom data was used in this project have been recruited from an ongoing cardiovascular imaging substudy, a longitudinal prospective cohort study that started 2016 as part of the Kerckhoff Biomarker Registry (BioReg). This study included patients with clinical indication for cardiovascular magnetic resonance (CMR) and who were older than 18 years. All patients gave written informed consent and the study was approved by the local ethics committee. Patients enrolled with suspected coronary artery disease (CAD) or with suspected progress of an established CAD who therefore underwent CMR stress testing were eligible for further investigation within the present project. For the present analyses, only patients in whom an XML ECG and an MRI dataset were available were used, leading to a study sample of 114 patients.
Clinical variables: At baseline, the following used variables were obtained using questionnaires, patient history, examination and available medical documentation: sex, age, body-mass-index (Appendix, Formula 1), body-surface-area (m 2 ) (Appendix, Formula 2), smoking status, arterial hypertension (defined as systolic blood pressures ≥140 mm Hg and diastolic ≥90 mm Hg or the intake of antihypertensive medication), diabetes mellitus (defined as fasting concentration of blood sugar ≥126 mg/dl, non-fasting blood sugar ≥200 mg/dl or the intake of antidiabetic medication), dyslipidemia, presence of familial disposition to heart disease, and presence of CAD defined as atherosclerotic changes of the coronary vessels. Table 3 provides an overview of the distribution of cardiovascular risk factors in the used cohort. As only patients with relevant risk scheduled for an MRI were enrolled, the resulting cohort might not be representative for a general population. For comparison, prevalences described in the literature are given as Table 7 in the Appendix.
ECG data: In each patient of the evaluated cohort, a 12-lead ECG was obtained on the same day as the MRI examination. The resulting ECG recording contains time-series of each lead of 10s length and is stored in the XML-based HL7v3 format (Health Level Seven, Inc. 2020). The ECG dataset was generated using a commercially available electrocardiogram device (Cardiovit AT-102P, Schiller-Reomed AG, Obfelden, Switzerland) with the following parameters: Frequency range: 0.05-150 Hz; Measuring range: ±300 mV; Sampling rate: 500 data points per second/5000 in 10 s (per lead); Digital resolution: 5 μV/18 bit.
MRI data: Each evaluated patient underwent a clinically indicated MRI evaluation using standardised procedures. In this imaging examination, structural conditions of the myocardium are assessed based on the temporal and spatial distribution of Gadolinium as contrast agent (Dulce et al. 1993;Saeed et al. 1989). Intact cell membranes cannot be passed by this contrast agent, what leads to higher contrast of structurally impaired tissue (e.g. scar) (Kim et al. 1999). This enables to identify myocardial areas showing so-called late contrast enhancement (LE) as areas with myocardial scar. This definition of final diagnosis was done in all patients by experienced cardiologists blinded to ECG data. The MRI dataset was generated using a clinically established MRI device (Magnetom Skyra 3T, Siemens Healthcare GmbH, Erlangen, Germany) with a stress MRI protocol using Adenosin for induction of pharmacological stress.

Algorithmic procedure
In the following subsection, we describe in detail components of the data pipeline presented within section Results. For an overview of the data pipeline, see Figure 1. Furthermore, we present the performance metrics used for evaluation of the DL models and provide details about the training procedure.
Cropping: By cropping, the ECG recordings were cleaned from artefacts that for instance may have been caused by inappropriate electrode contact. In the ECG recordings used in this work, artefacts were only present at start and end of the recordings, what simplified their elimination by removing the corresponding intervals.
Scaling: The ECG recordings contained values measured in the range of −2000-2000 μV. We scaled the values to mV, yielding a range of −2-2 mV. This reduced the absolute distance between minimum and maximum values, what is a recommended preprocessing step for data to be fed into neural networks (Bishop 1995, p. 298f.).
Augmentation: In order to prevent the neural network from overfitting on the comparably small dataset of 114 records, we increased the amount of unique ECG samples by data augmentation via subsampling. This is medically justified as the presence of MS and the thereby associated impaired conductivity of the myocardium should lead to permanent changes in the shapes of the ECG waveforms present in all subsamples. Therefore, shorter samples than a standard of 10 s contain features relevant for distinguishing pathological ECG with patterns caused by MS from those without such patterns.
In order to capture at least two heartbeats in each sample, a subsample size of 4 s was used. This approach has already been used previously (Strodthoff and Strodthoff 2019). Figure 2 illustrates the subsampling process used in this work. Subsampling generates new, distinct, non-identical samples from an existing sample using equidistant strides of an extraction window to extract overlapping subsamples. The proportion of overlap varies according to the window size w, the sample size s and the subsampling factor f. based on these parameters, the stride distance d is defined as The clinical parameters for each subsample were copied from the original dataset sample without further adaption. We used a windowsize of 2000 timesteps and a subsampling factor of 100. The stride distance varied between 33 and 150, depending on the size of the cropped samples, which were in the range of 2664-5000 time steps.
Categorisation and one-hot encoding: All clinical parameters were one-hot encoded and combined to a 55-valued vector which served as input for the FFN. In a previous step, all parameters based on continuous float values were categorised using the ranges listed in Table 6. K-fold cross-validation: The models presented in this work were evaluated using k-fold cross-validation (KFCV) (Stone 1974(Stone , 1978,  which is a common method in ML for training on small datasets (Bishop 1995, p. 372ff.). The main idea behind KFCV is to use all available samples for both training and validation. The initial overall dataset was first shuffled and then split into six partitions, each containing subsamples of 19 patients. Each partition was used for validation once, while the five remaining partitions were used for training. These six different split-variants were used throughout all experiments.
We repeated the training on each split-variant 50 times and selected the best model for each variant using the AUC on patient-level (see next Paragraph) as target metric for performance evaluation and model selection. This selection resulted in six models, whose performance metrics were used to calculate the cross-validated performance based on their mean values.
Performance metrics: We used the following metrics to evaluate the performance of the ML models: Sensitivity: Refers to the ability of the DL model to correctly detect diseased patients. It is defined by the ratio of true positive (TP) predictions and the sum of TP and false negative (FN) predictions (Appendix, Formula 3). A TP prediction occurs if the presence of MS was identified by the DL model and was also diagnosed by MRI. A prediction is a FN if no MS was detected but the presence of MS was diagnosed by MRI.
Specificity: Refers to the ability of the DL model to correctly reject nondiseased patients. It is the ratio of true negative (TN) predictions and the sum of TN and false positive (FP) predictions (Appendix, Formula 4). A prediction is a TN if the DL model identified no presence of MS and this coincides with the MRI diagnosis. A FP prediction occurs if MS was detected by the model but was not diagnosed by MRI.
Accuracy: Is defined by the ratio of correct predictions to the total number of all predictions (Appendix, Formula 5).
Area under the curve (AUC): Score is frequently used for evaluation of diagnostic methods in medicine (Bradley 1997;Downey et al. 1999) and is defined by the area under the receiver operating characteristic (ROC) curve. The ROC curve graphically shows the relation between the TP rate (sensitivity) and the FP rate (1-specificity), when the decision threshold is varied from the value 0 to 1 (Hanley and McNeil 1982;Metz 1978). An advantage of using AUC is its independence from the decision threshold. An AUC of 0.5 is equivalent to random guessing.
Negative predictive value (NPV): Describes the share of TN in the total negative predictions (Appendix, Formula 6).
Positive predictive value (PPV): Describes the share of TP in the total positive predictions (Appendix, Formula 7).
All performance metrics were computed on subsample-and patient-level. Metrics on subsample-level were directly based on the model predictions. For the patient-level metrics, we calculated the mean of the predictions for all subsamples belonging to an individual patient.

Training
Keras (Chollet 2015) with the Tensorflow (Abadi et al. 2015) backend was used for development and training of the neural networks. We performed hyperparameter optimisation based on grid search. An overview of the investigated hyperparameter values is listed in Table 5 in the Appendix. Further details about the training are provided in Table 4 in the Appendix. have applied for a patent on a method and apparatus for prediction and localisation of structural changes in myocardial tissue (DRN 2020021415065300DE).  Each plot shows the ROC curves of the six best models determined by 6-fold crossvalidation on subsample-level (A) and on patient-level (B). Additionally, the summarised ROC (SROC) curves were calculated based on the sensitivities and specificities of the best models using the method proposed by Littenberg and Moses (1993). AUC = area under ROC curve.