A treecode based on barycentric Hermite interpolation for electrostatic particle interactions

Abstract: A particle-cluster treecode based on barycentric Hermite interpolation is presented for fast summation of electrostatic particle interactions in 3D. The interpolation nodes are Chebyshev points of the 2nd kind in each cluster. It is noted that barycentric Hermite interpolation is scale-invariant in a certain sense that promotes the treecode’s e ciency. Numerical results for the Coulomb and screened Coulomb potentials show that the treecode run time scales like O(N logN), where N is the number of particles in the system. The advantage of the barycentric Hermite treecode is demonstrated in comparisonwith treecodes based on Taylor approximation and barycentric Lagrange interpolation.


Introduction
Electrostatic e ects play an important role in determining biomolecular structure, function, and dynamics [4,16]. For example among a wide range of applications, biophysics researchers are interested in understanding the in uence of electrostatic e ects on protein folding and protein-ligand binding [30,31]. Electrostatic e ects are inherently multiscale and they pose an outstanding challenge in computational molecular biophysics [23]. Powerful computer hardware is certainly required, but advances in numerical algorithms are also necessary to achieve the goal of accurate and e cient simulations of large-scale biomolecular systems.
Here we consider the basic problem of computing the electrostatic potential due to a set of charged particles in 3D space, where x i is a target site, x j is a source site of a particle with charge q j , and g(x, y) is an interaction kernel. Two important examples are the Coulomb potential and screened Coulomb potential, where κ is the inverse Debye length. Computing the potential ϕ(x i ) at all the target sites by direct summation requires O(N ) operations and many fast summation methods have been developed to reduce the cost including particle-mesh methods (e.g. PPPM [15], PME [8], MSM [13]) and tree-based methods (e.g. treecode [1], FMM [11], panel clustering [12]). However there is continuing interest in exploring di erent approaches to improve the performance of these methods, especially in the context of parallel simulations of large-scale biomolecular systems. Treecode algorithms are interesting due to their relatively simple structure and low memory consumption. In these algorithms, some of the particle-particle interactions are computed directly and others are computed by particle-cluster approximations. The resulting run time scales like O (N log N), where N is the number of particles in the system. Previously we developed a treecode based on Cartesian Taylor approximation (TTC) [18] and a kernel-independent treecode based on barycentric Lagrange interpolation (BLTC) [27]. Here we extend the latter method to obtain a treecode based on barycentric Hermite interpolation (BHTC); it interpolates the kernel and its rst partial derivatives in the three coordinate directions at Chebyshev points of the 2nd kind. As we explain below, barycentric Hermite interpolation is scale-invariant in a certain sense that promotes the e ciency of the treecode. Although the BHTC it is not strictly kernel-independent, for many kernels of practical interest the necessary derivatives can be easily computed. In this work we demonstrate the BHTC in computing electrostatic potentials due to Coulomb and screened Coulomb interactions, and numerical results show the improved performance of BHTC in comparison with TTC and BLTC.
The paper is organized as follows. Section 2 reviews polynomial interpolation in 1D. Section 3 discusses the extension to polynomial interpolation in 3D. Section 4 presents the barycentric Hermite treecode. Section 5 displays numerical results. Conclusions are given in Section 6.

Polynomial interpolation in 1D
We review polynomial interpolation in 1D including Lagrange form, barycentric Lagrange form, and barycentric Hermite form [25].

Lagrange form.
Given a function f (x) and a set of n + distinct nodes {x , . . . , xn}, the Lagrange form of the interpolating polynomial is where the Lagrange polynomials satisfy L k (x j ) = δ jk for j, k = : n. This form is well-known and appears in many textbooks, however Berrut and Trefethen [3] pointed out the advantages of the barycentric Lagrange form.
Barycentric Lagrange form. The interpolating polynomial can also be written in barycentric Lagrange form, where the removable singularity at the nodes is resolved by setting p(x k ) = f (x k ). Here we choose the nodes to be Chebyshev points of the 2nd kind, x k = cos θ k , θ k = kπ/n, k = : n.
In this case as shown in [22], the barycentric weights w k in (4) may be taken as Among the advantages of barycentric Lagrange interpolation at Chebyshev points are excellent approximation properties, low cost, and stability [3,25,14]. It also has a scale-invariance property in that the same weights (6) can be used for any interval [a, b], as long as the Chebyshev points are adapted to the interval; this is particularly useful in the context of the treecode [27].
Barycentric Hermite form. Next we consider Hermite interpolation [24,21,2]. Given function values and rst derivative values at the nodes, {f k , f k , k = : n}, the Hermite interpolating polynomial can be written in the form where the Hermite basis functions H k (x), H k (x) are polynomials of degree n + satisfying It can be veri ed that the polynomial (7) interpolates the prescribed function and derivative values at the nodes, p(x k ) = f k , p (x k ) = f k , k = : n. Following [24], the Hermite interpolating polynomial can be written in barycentric form, where again the removable singularity is resolved by setting p(x k ) = f k . When the nodes are the Chebyshev points of the 2nd kind (5), it was shown in [2] that the barycentric Hermite weights may be taken as For reference note that the Hermite basis functions are In practice the removable singularity at x = x k can be handled following [3] (see also [27]); if the argument x is closer to a node x k than some tolerance, this is agged and the correct values H k (x j ) = δ jk , H k (x j ) = are provided when evaluating the polynomial (7); details relevant to the treecode will be given further below.
Extension to a general interval. The Chebyshev points (5) are suitable for interpolation on the interval [− , ]. For a general interval [a, b], the function x → (a + b)/ + x(b − a)/ is used to map the Chebyshev points to the desired interval, but we must also consider what happens to the barycentric Hermite weights a k , b k in (10). First note that the barycentric expressions (9), (11) are scale-invariant; if the weights a k , b k have a common factor independent of k, then the factor can be cancelled and the expressions (9), (11) stay the same. In the case of general nodes, following the derivation of the barycentric Lagrange weights in [22], it can be shown that the barycentric Hermite weights satisfy Hence under the mapping [− , ] → [a, b], the weights b k acquire a factor ( /(b − a)) n , while the weights a k acquire a factor ( /(b − a)) n+ . Due to scale-invariance, the factor ( /(b − a)) n can be cancelled from a k , b k , so that for a general interval [a, b], the barycentric Hermite weights in (10) should be scaled so that a k → a k /(b − a), while the b k stay the same. The scale-invariance property of barycentric Hermite interpolation facilitates the treatment of di erent clusters in the treecode and this promotes the scheme's e ciency.

Extension to 3D
Now consider a function f (x) = f (x , x , x ), and a tensor product grid of Chebyshev points s k = (s k , s k , s k ) ∈ [− , ] , where k = (k , k , k ) is a multi-index with k = , . . . , n for = , , . Letting p(x , x , x ) denote the Hermite interpolating polynomial, there are eight interpolation conditions to be satis ed at each Chebyshev point s k , where the subscript denotes a partial derivative, for example p = ∂ p/∂x ∂x ; here and below we sometimes omit function arguments for clarity. The eight conditions (13) represent all combinations of rst partial derivatives in three variables. Using the properties of the basis functions (8), it follows that the Hermite interpolating polynomial is where , and superscript s indicates that f and its partial derivatives are evaluated at the Chebyshev points (s k , s k , s k ). It can be veri ed that (14) satis es the interpolation conditions (13).
This result is applied in the treecode to approximate a kernel g(x, y) where x, y ∈ R . In that case, holding x xed and interpolating with respect to y yields, where now the target variables (x , x , x ) are parameters in the kernel and its partial derivatives, the Hermite basis functions are evaluated at the source variables (y , y , y ), while the partial derivatives of the kernel are taken with respect to (y , y , y ) and are evaluated at the Chebyshev points (s k , s k , s k ).

Barycentric Hermite treecode
In a treecode the particles x i are partitioned into a hierarchy of clusters C, and the sum (1) is written as where is the interaction between a target particle x i and a cluster of source particles C = {y j }. The sum over C in (16) denotes a suitable set of clusters depending on the target particle. In our implementation, the clusters are rectangular boxes.
Particle-cluster approximation. The particle-cluster interaction (17) can be approximated using barycentric Hermite interpolation as follows. We substitute the kernel approximation (15) into the particlecluster interaction (17) and switch the order of summation to obtain the particle-cluster approximation, and y j = (y j , y j , y j ) are the source particles. Note that in the exact particle-cluster interaction (17) the target x i interacts with the source particles and charges {y j , q j }, while in the particle-cluster approximation (18) the target x i interacts with the Chebyshev grid points and modi ed charges {s k , q k }. It should also be noted that the Chebyshev grid points are mapped to the cluster C using the linear function mentioned above in the discussion of general intervals [a, b].
The particle-cluster approximation has two advantages. First, the operation count for the exact interaction (17) is O(Nc), where Nc is the number of particles in cluster C, while the operation count for the approximation (18) is O(n ), hence the approximation is faster when n << Nc. Second, the modi ed charges are independent of the target x i , so they can be precomputed and reused in di erent particle-cluster approximations.
Treecode algorithm. The treecode algorithm has two stages. Stage 1 starts by building a hierarchical tree of particle clusters. The root cluster is a rectangular box containing all the particles. The root is bisected along its long edges, yielding 2, 4, or 8 child clusters, and these are similarly bisected until a cluster has fewer than a user-speci ed number of particles N . A long edge is de ned to be any edge whose length exceeds max / √ , where max is the maximum edge length of the cluster. Once the tree is built, the modi ed charges (19) for each cluster are computed; the procedure is explained below. This completes stage 1.
In stage 2, the treecode cycles through the target particles and computes particle-cluster interactions starting at the root. Well-separated particle-cluster interactions are computed using the approximation (18). The criterion for being well-separated is r/R < θ, where r is the cluster radius, R = |x i − yc| is the distance between the target particle and the cluster center, and θ is a user-speci ed parameter; this is the multipole acceptance criterion (MAC). If the MAC fails, then the code descends to the child clusters, until the leaves are reached at which point the particle-cluster interaction is computed by direct summation (17). This is essentially the Barnes-Hut treecode algorithm [1], extended to barycentric Hermite interpolation; references can be consulted for further details [18,27].
Computation of modi ed charges. We explain some details of how the modi ed charges (19) for a given cluster are computed. Note that the Hermite basis functions appearing there can be written concisely in barycentric form (11) as where (y j , y j , y j ) are the source particles in the cluster and (s k , s k , s k ) are the Chebyshev points adapted to the cluster. First the temporary variables c j,k, , d j,k, are computed; there are three loops, one loop over source particles j = : Nc, one loop over Chebyshev points k = : n, and one loop over coordinate indices = , , ; this requires O(nNc) operations and a storage array of size O(nNc) which is reused for di erent clusters. To handle the removable singularity, we adopt the procedure in [3] (page 510), see also [27]; the code checks whether a source particle coordinate is close to a Chebyshev point; the condition is |y jl − s k | ≤ DBL_MIN, where DBL_MIN = 2.22507e-308 is the smallest positive IEEE double precision oating point number; if this condition is met, then the temporary variables c j,k, , d j,k, are adjusted to enforce the interpolation conditions (8). Having computed the temporary variables c j,k, , d j,k, , the code computes the Hermite basis functions in (20), and then the modi ed charges are computed as indicated in (19). For each cluster this process requires O(n Nc) operations and O(n ) storage. The modi ed charges are computed for each cluster when the tree is constructed and in practice this requires only a small fraction of the total treecode run time.
Relation to other methods. We brie y discuss the relation of the barycentric Hermite treecode (BHTC) to some other tree-based fast summation methods. Makino [19] proposed using pseudoparticles in a treecode or FMM to reproduce the multipole moments of the actual particles in a cluster; in some sense the Chebyshev grid points in the BHTC can be considered as pseudoparticles, although their charges are determined by interpolating the kernel and its derivatives, rather than moment matching. Fong and Darve [9] developed the black-box FMM which uses polynomial interpolation at Chebyshev points of the 1st kind to approximate cluster-cluster interactions, where the interpolating polynomial is expressed in terms of Chebyshev polynomials; by contrast, the BHTC uses Hermite polynomial interpolation at Chebyshev points of the 2nd kind to approximate particle-cluster interactions, where the Hermite interpolating polynomial is expressed in barycentric form. Finally note that the BHTC is not strictly kernel-independent in the same sense as the KIFMM [28,29] and bbFMM [9], which require only kernel evaluations, but for many kernels of practical interest the rst partial derivatives needed in BHTC are easily computed. The methods mentioned here have various similarities and di erences, and a detailed performance comparison is an interesting topic for future investigation.

Numerical results
The BHTC was programmed in C++ in double precision and compiled using the Intel compiler icpc with -O2 optimization. The computations were performed on the University of Wisconsin-Milwaukee Mortimer Faculty Research Cluster, where each node is a Dell PowerEdge R430 server with two 12-core Intel Xeon E5-2680 v3 processors at 2.50 GHz and 64 GB RAM. Parallel runs used OpenMP with up to 24 cores on a single node. The source code is available for download (github.com/Treecodes/barycentric-hermite-treecode). We start with a single particle-cluster interaction and proceed to the full treecode. The particle charges are random numbers in the interval [− , ] and the screened Coulomb potential uses κ = . for the inverse Debye length.

. A single particle-cluster interaction
Consider a target particle x = ( . , . , . ) and a cluster of N = randomly located source particles in the unit cube [ , ] ; this corresponds to MAC θ = . . The potential is computed at the target using Taylor approximation [18], barycentric Lagrange interpolation [27], and barycentric Hermite interpolation as described above. Figure 1 plots the error as a function of the degree parameter n = : : , where the exact value is computed by direct summation (17). Note that the degree parameter n has two di erent meanings; in the case of the Taylor approximation, the expansion is carried out at the cluster center yc = ( . , . , . ) and includes terms (x − yc) k satisfying k + k + k ≤ n, whereas in the case of the barycentric approximations, n is the degree of the interpolating polynomial in each direction. Bearing this in mind, Fig. 1 shows that the error decreases most rapidly for barycentric Hermite interpolation, followed by barycentric Lagrange interpolation and then Taylor approximation. Results for the Coulomb potential and screened Coulomb potential are very close. Note however that the three approximations have di erent cost, so next we consider the error and run time in the full treecode. for Taylor approximation, barycentric Lagrange interpolation, barycentric Hermite interpolation, (a) Coulomb potential, (b) screened Coulomb potential.

. Treecode results
We compute the electrostatic potential (1) for N charged particles randomly located in a cube. The maximum leaf size is N = . The error is measured by where ϕ d (x i ) is computed by direct summation and ϕ t (x i ) is the treecode approximation.
Run time versus error. Figure 2 compares the BHTC with treecodes based on Taylor approximation [18] and barycentric Lagrange interpolation [27]. Results are shown for the Coulomb and screened Coulomb potentials (a,b), with system size N=1e6. The run time is shown versus error for treecode MAC θ = . , . and degree n = 1:2:15 (T), n=1:2:9 (L,H). With these parameters, the Taylor and Lagrange treecodes reach error 1e-8, while the BHTC reaches error 1e-13. The direct sum run time is the red horizontal line (4306 s in (a), 8146 s in (b)), and all three treecodes are faster than direct summation. For both the direct sum and treecode, the run time for the screened Coulomb potential is roughly twice that for the Coulomb potential; this is attributed to the exponential function evaluation in the former. For a given error E, the Taylor and Lagrange treecodes have comparable run time, but the BHTC is faster than both, especially as the error decreases. Focusing on the BHTC, MAC θ = . is more e cient for error above 1e-9 and MAC θ = . is more e cient for error below 1e-9.
Error and run time versus system size. Figure 3 presents the BHTC error (a,b) and run time (c,d) for system size N=1e5, 1e6, 1e7, with MAC θ = . and degree n = : : . Results are shown for the Coulomb (a,c) and screened Coulomb (b,d) potentials. Since the direct sum for the largest system size N=1e7 is prohibitively expensive, in that case the error is computed at 2000 sample target particles and the run time is extrapolated from the smaller systems. Figure 3a,b shows that for a given degree n, the treecode error is nearly independent of system size N, and for a given system size N, the error decreases as the degree n increases. The errors for the two potentials are comparable. Figure 3c,d shows that for a given degree n, the treecode run time increases at a slightly superlinear rate consistent with O (N log N) scaling. For a given system size N, the treecode run time increases as the degree n increases, but for these parameters it remains below the direct sum run time. Speedup. Table 1 quanti es the speedup de ned as the ratio of run time for the direct sum (d) and BHTC (t). Results are shown for treecode parameters MAC θ = . and degree n = , yielding error E between 7.65e-6 and 2.59e-5. The speedup is similar for the Coulomb potential and screened Coulomb potential, and it increases with system size; the speedup is approximately 7× for N=1e5, 45× for N=1e6, and 400× for N=1e7.  Table 2 records the memory (MB) consumed by direct summation and the three treecodes (TTC, BLTC, BHTC); this was obtained using the Valgrind Massif tool (www.valgrind.org). Results are shown for the Coulomb potential with MAC θ = . , but the same results hold for the screened Coulomb potential and di erent θ. Table 2 shows how the memory consumption depends on the system size N and degree n; these parameters contribute as follows. First, the particle coordinates and charges are stored in arrays of size O(N); this memory is common to all methods and can be seen in the direct sum column in Table 2. Second, the treecodes require memory for the tree data structure. A certain portion of this memory is overhead common to all three treecodes and is O(N); this accounts for the increase between the direct sum column and the n = column in  Table 2. The treecodes di er however in the memory required for each cluster. The TTC stores cluster moments in arrays of asymptotic size n ; the factor arises from the sum k + k + k ≤ n. The BLTC stores modi ed charges in arrays of asymptotic size n ; this is due to the tensor product nature of the Chebyshev grid points. The BHTC stores modi ed charges in arrays of asymptotic size n ; the factor arises because there are eight modi ed charges at each Chebyshev grid point as shown in Eq. (19). Hence the treecode memory consumption is O(N) + O(Nn ), where the rst term is the particle memory and tree data structure overhead, and the second term is the cluster memory with a prefactor depending on the treecode version. These trends can be seen in Table 2. For degree n = all three treecodes use the same memory to within two decimal digits; in this case the O(N) particle/overhead memory dominates the O(Nn ) cluster memory. As the degree n increases, the TTC memory remains the same, while the BLTC memory increases, and the BHTC memory increases even more; in this case the O(Nn ) cluster memory eventually dominates the O(N) particle/overhead memory. We checked that for larger degree n, the TTC memory also shows the expected O(Nn ) increase. Note that even with the largest system size N=1e7 and degree n = , the BHTC still uses less than 7 GB of memory. The larger memory consumption of the BHTC in comparison with the BLTC and TTC is compensated by its better performance as shown in Fig. 2.  Table 3 records parallel results for the direct sum and BHTC using OpenMP with up to 24 cores on a single node. The system size is N=1e6 and the treecode uses MAC θ = . , degree n = , yielding error about 2.6e-5. As before, the run time for the screened Coulomb potential is about twice that for the Coulomb potential. Direct summation parallelizes very well; the parallel e ciency remains above 86% up to 24 cores. The treecode also parallelizes well considering its complexity; with 24 cores the parallel e ciency is 61% for the Coulomb potential and 68% for the screened Coulomb potential. With 1 core the treecode is 45× faster than direct summation, and even with 24 cores the treecode is still more than 32× faster than direct summation.

Conclusions
We presented a particle-cluster treecode based on barycentric Hermite interpolation (BHTC) for fast summation of electrostatic particle interactions in 3D. The interpolating polynomial matches the kernel and its rst partial derivatives in the three coordinate directions at a tensor product grid of Chebyshev points in each cluster. Barycentric Hermite interpolation is scale-invariant in a certain sense that facilitates the treatment of di erent clusters in the treecode and this promotes the scheme's e ciency. The treecode run time scales like O (N log N), where N is the number of particles in the system. We demonstrated the method for the Coulomb potential and screened Coulomb potential. Numerical results show that for these two kernels, the BHTC is more e cient than the Taylor treecode [18] and barycentric Lagrange treecode [27] especially in the high accuracy regime. The parallel e ciency of the BHTC was demonstrated using up to 24 cores on a single node.
There are several directions for future research. In principle the BHTC can be extended to interpolate higher-order derivatives, but it remains to be seen whether this is advantageous in practice. Although we demonstrated the method for electrostatic potentials, it can be extended to compute electrostatic forces arising from Coulomb and screened Coulomb potentials, as needed in molecular dynamics simulations. It would also be interesting to consider whether the BHTC is advantageous for more complicated kernels such as those arising in the Method of Regularized Stokeslets [5,6,20]. We plan to install the BHTC in the Table 3: Barycentric Hermite treecode parallel performance, random particles in a cube, system size N=1e6, treecode parameters MAC θ = . , degree n = , Coulomb potential (error E = 2.58e-5), screened Coulomb potential (error E = 2.59e-5), number of cores (nc), run time (s) is shown for direct sum and treecode (d, t), parallel speedup = d /dnc , t /tnc, parallel e ciency (PE %) = (d /dnc)/nc, (t /tnc)/nc, parallel treecode speedup (d/t treecode-accelerated boundary integral (TABI) Poisson-Boltzmann solver [10] and the Adaptive Poisson-Boltzmann Solver (APBS) [17] for electrostatics of solvated biomolecules. Work is also proceeding on a GPU implementation of the BHTC [26].