Using OpenCL, we developed a cross-platform software to compute electrical excitation conduction in cardiac tissue. OpenCL allowed the software to run parallelized and on different computing devices (e.g., CPUs and GPUs). We used the macroscopic mono-domain model for excitation conduction and an atrial myocyte model by Courtemanche et al. for ionic currents. On a CPU with 12 HyperThreading-enabled Intel Xeon 2.7 GHz cores, we achieved a speed-up of simulations by a factor of 1.6 against existing software that uses OpenMPI. On two high-end AMD FirePro D700 GPUs the OpenCL software ran 2.4 times faster than the OpenMPI implementation. The more nodes the discretized simulation domain contained, the higher speed-ups were achieved.
Computational models of electrophysiology of cells and tissue are widely used in cardiac research. Most cardiac tissue simulators, like most other scientific software, perform calculations using the central processing unit (CPU) of a computer. Another computing unit present in most computers is the graphical processing unit (GPU). Originally designed to render 3D scenes to the screen in real-time, GPUs provide multiple pipelines for, originally, bulk parallel geometric calculations with low memory requirements. Recently, GPUs have been used more and more for general purpose, non-graphics, but highly parallel computing. Using parallel computing devices such as GPUs for cardiac mono-domain simulation has been shown to greatly improve simulation times .
We implemented the models described in section2 in a new, simple software to simulate cardiac tissue in a faster way. We employed the mono-domain model, finite differences discretization and OpenCL parallelization. The software was developed in the C programming language, parallelized portions in OpenCL C.
2.1 The mono-domain model
Assuming stationary current fields, we can formulate Poisson’s equation of the electric potential for the intracellular domain of electrically active tissue as follows:
where Vm is the transmembrane voltage, σ a conductivity tensor, and fi the total intracellular current source density. For tissue consisting of electrically active cells, the latter can be expressed more explicitly. The model expressed in one partial differential equation (PDE) then reads
with the myocyte surface to volume ratio β, the specific membrane capacitance per area Cm, the transmembrane ionic current density Iion, and a stimulus current density Istim .
Iion was calculated using the atrial myocyte model by Courtemanche, Ramirez, and Nattel (CRN). The model contains formulations for various ionic currents through the ion channels, exchangers, and generally the cell membrane. Iion is the total density of these currents. The ionic currents depend on Vm and various state variables. Cell model state variables are formulated as a system of ordinary differential equations (ODEs) such that
where η is a vector of L state variables with initial values η(t = t0) = η0 , and f a function ℝ × ℝ → ℝL.
2.2 Finite differences discretization
The finite differences method approximates partial derivatives with differences. For this purpose, the domain is discretized spatially into regularly distributed nodes. The first order derivative of a variable u in one dimension x can then be approximated by a difference such as
or similar, where Δx is the distance between the nodes. ForN nodes u = (u0,…, uN)T the derivative can then be approximated as ∂xu = (∂xu0 ,…, ∂xuN)T ≈ Mu with M ∈ ℝN × N.
Thus, the left hand side of eq.1 was approximated as
where x is a vector of spatially discrete values of Vm and A is the finite differences system matrix. We assumed no-flux boundary conditions.
2.3 Time-stepping solution of differential equations
Solving eq.1 for yields
With A and x as introduced in section 2. 2, and using forward Euler integration, we get the discretized formulation
Here, i is the vector of the spatially discretized current densities Iion + Istim as x is to Vm. Iion are calculated by one CRN model instance per node.
Generally, ODEs of the cell model states (eq. 2) were integrated using the explicit forward Euler method. Thus, for a state variable η it was
where Δt is the time step and is given by the cell model. However, whenf was of the form
with the Vm-dependent steady-state value η∞ and rate variable τη, we used the Rush-Larsen method instead. It was then
The Rush-Larsen method is also explicit but more stable than the forward Euler method .
2.4 Error estimation
To quantify the accuracy of the heavily optimized OpenCL software, we compared the activation times of the node furthest from the stimulation area. A node was considered activated once Vm > 0 V.
Similarly to , we estimated the overall simulation error with a relative root-mean-square (RRMS) measure
where are the vectors of nodal Vm values at time t for a reference solution and xt the corresponding vectors for the solution whose error is to be estimated.
3.1 Data structures and distribution
We used 64bit, IEEE-754 double precision floating point numbers as the default data type for all real values. Natural numbers (e.g., indices) were stored inm bit integer variables, wherem was either 32 or 64 depending on the smallest native integer data type of all devices in the so-called OpenCL context. In our case, the context always included all devices that were used in the simulation. Temporal values were also represented as natural numbers, in nanoseconds.
Vectors were implemented as continuous OpenCL buffers storing floating point values. A buffer in OpenCL is made completely accessible on a device in the context, when referenced as parameter to a so-called kernel (i.e., a function that can be run in parallel on the device). A sub-buffer is a buffer object that points to a block of memory inside another buffer, thus eliminating the need for the driver to copy the full buffer for a kernel invocation on the sub-buffer. For every vector buffer, we defined one subbuffer per device, splitting it into distinct slices of equal size.
The sparse quadratic finite differences system matrix was stored in the ELL format. The ELL format represents a matrix A ∈ ℝN × N with at most M non-zero elements per row as a tuple (K,V). Every row of K ∈ ℕN × M contains the column indices of all non-zero entries from A in that row, filled up with zeros if the row contains less than M nonzeros. V ∈ ℝN × M contains the non-zero values of each row, also filled up with zeros.
Dense matrices K andV were stored row after row in one continuous buffer in host memory each. For each matrix, only one sub-buffer was made available per device. Sub-buffer sizes were again equal.
3.2 Matrix-vector multiplication
Matrix-vector multiplication of the form b = Ax with A ∈ ℝN × N and x, b ∈ ℝN was implemented in an OpenCL kernel for a rowi ∈ [1, N] as
with M, V and K as defined above. These kernels were launched in parallel with N instances. Since the values in Kij were not restricted to device-local indices, x had to be readable completely from every device, while only the device-local sub-buffers were needed from b, V, and K. Each thread could thus set i to its so-called “global ID”, which was its thread ID on the device.
3.3 Cell model ODE kernels
All ODEs for one cell were integrated within one OpenCL kernel with the methods described in section 2.3. Since the cells were independent of each other for this step, these kernels were run with N parallel instances as well.
State variables from all cell models were stored in a global vector. The slice relevant to a cell was completely copied to private memory at the beginning of a kernel and restored at the end, coalescing the global memory access.
To improve calculation speed, any intermediate values that depended on Vm and not on any state variable were tabularized beforehand for a range of −0.2 V ≤ Vm ≤ 0.2 V with 4000 steps. This especially applied to Rush-Larsen based calculations (eq. 3) sincet was constant in this implementation and the whole exponential term could be tabularized.
4.1 Simulation setups
Our simulation setups were inspired by a previously published N-version benchmark of electrophysiological simulation software . The simulation setups and parameters were the same for the most part. However, we used the CRN atrial cell model with the initial values from . Our variations of the geometry are detailed in table 1. Variations a, b, and c corresponded directly to the ones used in the benchmark. Geometries d and e were elongated versions of geometry c.
|ID||Length (mm)||Spatial Resolution (mm)||Nodes|
|MPI||CPU||OpenMPI (all available cores)|
|CLCPU||CPU||OpenCL (all available compute units)|
|CLGPU||2 GPUs||OpenCL (all available compute units)|
Table 2 shows the parallelization variants we compared. CTL and MPI used our existing software acCELLerate . CLCPU and CLGPU used the newly written software on the CPU and both GPUs, respectively, using the methods and implementations described above. We compared the accuracy and speed against acCELLerate as the latter was itself benchmarked in . Both programs used the same finite differences system matrix. Tabularization, as described in section 3.3, was also done in both programs. For all setups, we simulated 200 ms at a time step of 0.01 ms.
All tests were carried out on a 2013 Apple MacPro with an Intel Xeon E5-2697v2 CPU and two AMD FirePro D700 GPUs. The CPU had 12 physical cores of 2.7 GHz, but provided 24 OpenCL compute units due to HyperThreading. For each GPU, the OpenCL driver reported 32 compute units with 150 MHz.
We compared CLCPU and CLGPU against CTL on geometries a-c. For the most coarse geometry (a), activation occurred at 38.45 ms for CTL and at 38.54 ms for both, CLCPU and CLGPU (0.09 ms or 0.23% later). For geometry b, activation occurred at 27.96 ms for CTL and at 27.97 ms for CLCPU and CLGPU (0.01 ms or 0.04% later). In geometry c, activation occurred at 26.66 ms for CTL and again 0.01 ms (i.e., one time step) or 0.04% later.
|Geo||t = 15 ms||t < AT||t < 200 ms|
|a||24.2||4.3 (5.61)||9.6 (2.53)||21.9 (1.10)|
|b||324.3||34.6 (9.37)||24.6 (13.19)||26.8 (12.11)|
|c||2570.2||289.0 (8.89)||187.6 (13.7)||743.8 (3.46)|
|d||6366.9||684.8 (9.30)||431.1 (14.76)||354.6 (17.80)|
|e||12732.2||1352.6 (9.41)||827.5 (15.39)||558.9 (22.78)|
Table 3 shows the RRMS error (eq. 4) of CLGPU against CTL for a single time step with an active excitation front (t = 15 ms), for the time span until activation, and for t from 0 ms to 200 ms, the latter two in steps of 1 ms. Since the differences between CLCPU and CLGPU were extremely small, the shown results compare only CTL with CLGPU.
Table 4 shows overall simulation runtimes and speed-up against CTL for the different setups. For all cases but one that are larger than geometry a, the OpenCL version improved on the OpenMPI parallelization. In the one exceptional case (c), the OpenCL implementation was noticeably slower on the GPU while no similar behavior was seen on the CPU. In this one case, the simulation was in fact faster on one GPU than on two.
With the exception of geometry c, the larger the number of nodes, the higher speed-up factors were achieved by the OpenCL parallelizations. This behavior was especially visible for the GPU case, improving the speed-up factor from 12.11 to 22.78 where the same implementation on the CPU improves from 13.19 to 15.39.
We implemented and evaluated an OpenCL-based mono-domain simulation software for cardiac tissue. Simulation speeds improved compared to our OpenMPI-based implementation. However, we did not see drastic speed-ups on GPUs as described in , even though the GPUs we used had a higher core speed.
We achieved higher speed-ups with more nodes on the GPUs. This can be attributed to the fact that the improvements of parallel integration of the many differential equations outweighs the synchronization overhead (that grows by only one (8 byte) value per node). Setup c however showed that the behavior is not monotonic but that there are cases where the distribution across two GPUs is not beneficial.
While accuracy was high for most geometries, there was noticeably slower excitation propagation in the coarse geometry a. While the delay of the activation time of the furthest node was relatively small (0.23%), it was nine times the time step. Together with the steep wave front this led to comparatively large relative errors throughout the simulations when comparing corresponding time steps. Comparing the activation times with those of geometries b and c however, it can be seen that the error caused by the coarse spatial resolution is much larger.
The developed implementation only uses one cell model for the whole tissue domain. Using multiple cell models would enable simulation of heterogeneous tissue.
OpenCL provides some more features that we have not used, such as heterogeneous computing, using CPU and GPU in parallel. A further speed-up seems possible since only the vector of transmembrane voltages (i.e., 8 MiB for one million nodes) needs to be synchronized at every time step, and data transfer rates between the devices are upwards of 200 GiB/s.
Conflict of interest: Authors state no conflict of interest. Material and Methods: Informed consent: Informed consent has been obtained from all individuals included in this study. Ethical approval: The research related to human use has been complied with all the relevant national regulations, institutional policies and in accordance the tenets of the Helsinki Declaration, and has been approved by the authors’ institutional review board or equivalent committee.
 Oliveira RS, Rocha BM, Amorim RM, et al. Comparing CUDA, OpenCL and OpenGL implementations of the cardiac monodomain equations. LNCS 2012; 7204: 111–120. Search in Google Scholar
 Clayton RH, Bernus O, Cherry EM, et al. Models of cardiac tissue electrophysiology: progress, challenges and open questions. Prog Biophys Mol Bio 2011; 104(1-3): 22–48. Search in Google Scholar
 Courtemanche M, Ramirez RJ, Nattel S. Ionic mechanisms underlying human atrial action potential properties: insights from a mathematical model. Am J Physiol Heart Circ Physiol 1998; 275: 301–321. Search in Google Scholar
 Niederer SA, Kerfoot E, Benson AP, et al. Verification of cardiac tissue electrophysiology simulators using an N-version benchmark. Phil. Trans. R. Soc. A 2011; 369: 4331–4351. Search in Google Scholar
 Seemann G, Sachse FB, Karl M, et al. Framework for modular, flexible and eflcient solving the cardiac bidomain equation using PETSc. J Math Indust 2010; 15(2): 363–369. Search in Google Scholar
© 2015 by Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.