CBSF: A New Empirical Scoring Function for Docking Parameterized by Weights of Neural Network

: A new CBSF empirical scoring function for the estimation of binding energies between proteins and small molecules is proposed in this report. The final score is obtained as a sum of three energy terms calculated using descriptors based on a simple counting of the interacting protein-ligand atomic pairs. All the required weighting coefficients for this method were derived from a pretrained neural network. The proposed method demonstrates a high accuracy and reproduces binding energies of protein-ligand complexes from the CASF-2016 test set with a standard deviation of 2.063 kcal/mol (1.511 log units) and an average error of 1.682 kcal/mol (1.232 log units). Thus, CBSF has a significant potential for the development of rapid and accurate estimates of the protein-ligand interaction energies.

At the same time, machine learning techniques are able to nd connections between input and output data that can be explicitly represented by a simple functional form, e.g. linear correlation (Fracchia, Frate, Mancini, Rocchia, & Barone, 2018;Sander, 2014). The weights of these trained models may be directly used as constant coe cients in the corresponding calculations if the data providing as variables is of the same type as those used for the model training. In this regard, it is appealing to design a neural network for predicting protein-ligand binding energies, the trained version of which amounts to a practically usable mathematical formula. Such approach possesses important advantages: the resulted expression is essentially an empirical scoring function, and empirical scoring functions are very fast and allow a rapid screening of a vast array of complexes (Huang et al., 2010), and usage of the neural network for producing of optimal constants for empirical soring functions holds potential for the development of highly accurate models.
In this paper we propose a new empirical scoring function, which is based on counting the proteinligand interacting atom pairs as descriptors and contains certain weights retrieved from pre-trained neural networks as constants. While the descriptors de ned by the atom pair interactions are not uncommon in the eld (Ballester & Mitchell, 2010;Durrant, Friedman, Rogers, & McCammon, 2013;Durrant & Mccammon, 2011;Durrant & McCammon, 2010;Sotri er, Sanschagrin, Matter, & Klebe, 2008;Wójcikowski, Ballester, & Siedlecki, 2017) because of their simplicity, they also have been criticized for their insensitiveness to the ligand poses and to the protonation states of the molecules (Gabel, Desaphy, & Rognan, 2014). Distance dependent terms and atom types as in Open Babel (Boyle et al., 2011) were incorporated into the model to address these de ciencies.
Several probe scoring functions were tested on 285 complexes of CASF-2016 benchmark set (Su et al., 2019), and the most accurate scoring function named CBSF reproduces the reference binding energies with a standard deviation of 2.063 kcal/mol (1.511 log units) and Pearson correlation coe cient (R) of 0.718. Therefore, the accuracy of our scoring function is comparable or higher than for most other known methods, both classical (empirical, force-eld and knowledge-based, R ∼ 0.4-0.7(Y. Li et al., 2014;Su et al., 2019)) and machine-learning based (R ∼ 0.6-0.8)(H. Li et al., 2018;Y. Li et al., 2014;Su et al., 2019). We believe that CBSF might nd use in molecular docking software for this reason. The implemented approach to empirical scoring function based on the neural network weights might also be of interest to further developments in the eld.

Methods . Datasets and Preprocessing of Data
Structures of protein-ligand binding complexes and their corresponding binding a nities were obtained from PDBBind database version 2018 (Liu et al., 2015(Liu et al., , 2017. A total of 4128 complexes were chosen from the "rened" subset for the training set of the neural networks while the CASF-2016 core set(Y. Li et al., 2014;Su et al., 2019) consisting of 285 complexes was used to test and evaluate the performance of the neural networks and the corresponding scoring functions. The criteria of the selection of the complexes for the training set is provided below.
The cartesian coordinates of the protein atoms were extracted from the PDB les containing the structure of binding pockets, all water molecules were removed. The coordinates of the ligand atoms were read from the SDF les. The protein-ligand atom pairs and total numbers of atoms in ligands were used as descriptors for the scoring functions and the neural networks. The protein-ligand atom pair was initially de ned as two atoms within 5 Å and such atom pairs were fed into the neural networks. At the same time, the neural networks were just the supplementary tools in the development of the scoring functions and one of their main tasks was to determine better individual cuto s for di erent types of atom pairs. Thereby the selection of atom pairs for the scoring functions was processed according to the cuto s produced by means of the neural networks, these values can be found among the project's les available at https://github.com/rsyrlyb/CBSF. We did not take into account hydrogen atoms except those connected to oxygen atoms or nitrogen atoms of amino groups. Atom types were assigned by the Open Babel software (Boyle et al., 2011). The atom pair labels were de ned by concatenating the atomic types of the contributing atoms sorted alphabetically with the semicolon symbol used as a separator, e.g. 'C+;C3'.
The input data for the neural networks was generated as follows. The descriptors collected from a single protein-ligand complex were initially organized in the matrix form (Figure 1), where each column contains all protein-ligand distances for a certain protein-ligand atom pair type. Complexes that are not matching with the complexes of the test set and with no more than 130 atom pairs of the same type were used for the model training (that is, the input matrices had no more than 130 rows), which resulted in the nal training set of 4128 complexes. The resulting matrices were broadcasted to the same shape and stacked into a three-dimensional table (number of samples × number of atom pair types (605) × 130). Further, atom pair types encountered in molecules no more than 10 times were separated into additional matrix of lower dimensionality (number of samples × number of rare atom pair types (474) × 10) for faster training of the neural networks. The obtained distances were normalized by dividing their values by 5. In addition to this 3D structures, a matrix containing the total numbers of atoms in ligands and target values of the binding energies (in kcal/mol) was used in the training of the neural networks.

. Scoring Function
The proposed scoring function is based on three terms: (1) the number of atom-atom pairwise interactions selected within a certain cuto distance, ( Figure 2); (2) number of interactions per atom of the ligand (further denoted as the density of interactions), and (3) correction that takes into account the atomic pair types. The overall free energy change according to our scheme is expressed as: where ∆G bind is the predicted binding free energy and E − E are the terms introduced above.

Figure 2:
Dependence of the experimental binding free energy on the number of protein-ligand atom pairs selected within 5 Å cuto , the outliers were removed using the elliptic envelope approach. Colors in the plots represent the density of the points. A 95% con dence interval for the slope and intersection are presented in the gure.
Calculation of E . The term E is based on the linear dependence between the number of protein-ligand atom pairs and the binding free energy. The scoring function ∆G is quite sensitive to this descriptor, and thus the distance cuto s used for selecting the atom pairs are very important. The individual cuto s for each atom pair type (605 in total) were determined using the neural network and applied in the scoring function. In a basic version of the scoring function: where a and b are coe cients of the linear equation, Np is the number of atom pairs, A and B are the atoms at the ligand (L)-protein (P) interface, R AB is the interatomic distance between atoms A and B, R (T A ; T B ) is the distance cuto for the atom types of A (i.e. T A ) and B (i.e. T B ). As a further modi cation of the method, the protein-ligand interatomic distances were divided into a certain number of intervals (K) and the pairs of (a , b ) parameters were assigned individually to each interval. The atom pairs from the intervals sharing the same sequential index are processed using the same (a , b ) parameters. The nal equation is: where K is the number of intervals, N p,i is the total number of atom pairs from the i-th interval, The upper limits of the intervals with the sequential number i make up an array of distances d i = (d i , d i , . . . , d i ), the lower limits are determined by the array of the preceding intervals i-1, the rst intervals have no lower limits. The exact same intervals with their characteristic arrays were applied for calculations of the rest two terms, as well. We explored the performance of several models that di er only in the numbers of used intervals (K = 1-4), which will be referred to as the basic scoring function (SF1), SF2, SF3 and CBSF (SF4), respectively.
Calculation of E . The term E is taking into account the buriedness of ligand as the ratio of atom pairs to the total number atoms on the ligand. This term is calculated in a similar manner as the previous term: where a ,i are the tting constants assigned to the i-th intervals and D i is the density of interactions calculated according to formula: where N L is the total number of atoms in the ligand. Equation 7 echoes the approach reported in (Spitzer, Cleves, Varela, & Jain, 2014), where the buriedness was measured by taking the ratio of near-ligand protein atoms to the total number of heavy atoms on the ligand.
Calculation of E . The term E is calculated as a sum of contributions from each atom pair in the complex depending on their types: where n ij is the number of atom pairs of j-th type in the i-th interval, w ij is the distance-dependent energy contribution assigned to the j-th atom pair in the i-th interval. All required constants of the scoring functions, such as d i , (a ,i , b ,i ), a ,i , and w ij , were determined by training of the convergent neural networks.

. Architecture of the Neural Networks
Several neural networks were developed to generate the scoring functions with di erent K value. Training and the testing of the neural networks were performed using Keras (Chollet & others, 2015) with Tensor ow (Abadi et al., 2016) as the back-end using in-house software implemented in Python 3.7 programming language.
Neural network designed for the basic scoring function The architecture of the neural network for obtaining of the parameters of the scoring function with one interval of distances (K = 1) is presented on Figure 3. The neural network consists of two blocks, preliminary and main. The preliminary block contains an input layer, followed by a noise layer adding Gaussian noise with a standard deviation of 0.008 to the input data to prevent over tting. The level of noise has been chosen by testing the neural network using di erent values of standard deviation of the noise distribution based on comparison of performances of the models. The main block consists of layers 2-5 described below. Layer 2 determines the relevant cuto s for the atom pair types. The layer was trained to nd a probability whether the distances between atoms are close enough to be considered. The output of this custom layer with custom hard sigmoid activation function F: where w ( ) j is the weight of layer 2 assigned to the corresponding column of the input data frame. The hard sigmoid function in this study has a transition zone of 0.1 Å, according to the features of the activation function: Therefore, the weights w ( ) j are the maximum possible distances between the atoms making up the atom pair. These values multiplied by 5 (denormalization) form the array of upper limits of distances d discussed earlier.
The output values are summarized for counting of the total numbers of atom pairs of di erent types n j in the complex. These values are collected into a single table with rows corresponding to separate complexes. The total number of atom pairs Np in the complex is found as a sum of elements of the row. This value is passed to the regression layer 3 calculating the term E . Besides, Np is used for the calculation of the density of interactions D which is the input for the regression layer 4 designed for the obtaining of the term E . Both trainable weight and biase were used in the layer 3 and trainable weight in layer 4, the weight and the bias of the layer 3 are relevant to the constants (a , b ) of the scoring function (eq. 2) and the weight of the layer 4 is a , constant (eq. 7).
Parallelly, the term E taking into account the types of atom pairs is calculated by the regression layer 5. As distinguished from the layers 3-4, the layer 5 takes multiple values as inputs and does not have a bias. The weights of the layer 5 form increments w j assigned to the atom pair types (eq. 9). The last operation of the neural network is summarizing of the outputs of the layers 3-5.
Neural networks for scoring functions with several cuto distance intervals Neural networks for the scoring functions with more than one intervals of distances consist of the same blocks as in the basic model, but the number of the main blocks is equal to the number of intervals and the nal predicted energy is the sum of the outputs of all these blocks (Figure 4). Another di erence is that the layers 3-5 of the main blocks process the atom pairs belonging to the corresponding distance intervals only. This was organized in the following manner: the layers 2 de ne the upper limits d of the ranges (as described in the basic model) and select all atom pairs within the interval (0, d). Output tables of the layers 2 of two subsequent main blocks i-1 and i contain the atom pairs from ranges (0, d i− ) and (0, d i ) respectively, and therefore, the atom pairs belonging to the interval (d i− , d i ) which have to be processed by the layers 3-5 of the main block i are found by the pointwise subtraction of the tables. Neural networks consist of 2-4 main blocks are denoted as NN-SF2, NN-SF3 and NN-SF4 further.

. Training of the neural networks
Initial weights of the neural networks were set as follows: 1) Minimal and maximal observed distances for each type of atom pair were retrieved from the preprocessed data. The obtained ranges of distances were normalized and divided into K equal intervals, the upper boundaries of these intervals were used as initial weights of the layers 2 of the main blocks. Figure 2, were used as the weight and bias of the layer 3 of the main blocks. 2. Weights of the layers 5 were set to zero.

Coe cients (a , b ) initially estimated using SciPy library and presented in
The trainings were carried out in three stages: the weights of the layers 3-4 were optimized in the rst step, the other weights were frozen. The weights of the layers 5 were unfrozen in the second step, and optimizations of all weights, including the weights of the layers 2, were provided in the step three. K-fold cross-validation approach was used for each step of the training with k=9. The numbers of epochs for each round of learning were 50, 50 and 40 in the base model and 20, 20, 15 in the models with a few intervals for the stages 1-3 respectively.The Adam optimizer was used to minimize the mean squared error during optimization.

. Evaluation methods
Mean absolute error (MAE) and median error (median) were calculated using corresponding functions of NumPy (van der Walt, 2011). Estimation of scoring power was carried out by calculating of Pearson's correlation coe cient R and Standard deviation in tting (σ). Ranking power is presented by means of Spearman correlation coe cient (SP), Kendall correlation coe cient (τ) and predictive index (PI). Ready scripts for these calculations were taken from CASF-2016 benchmark (Su et al., 2019).

Results and Discussion . Comparative analysis of the composition of the suggested scoring functions
Our method is somewhat similar to the scoring functions based on counting of protein-ligand atom pairs, e.g. NNScore (Durrant et al., 2013;Durrant & Mccammon, 2011;Durrant & McCammon, 2010), SFCscore (Sotri er et al., 2008) and RF-Score (Ballester & Mitchell, 2010;Wójcikowski et al., 2017). A brief review of such methods can be found in (Guedes, Pereira, & Dardenne, 2018). Furthermore, some of these scoring functions, such as RF-Score and NNScore, were developed by means of the machine-learning techniques. However, there are signi cant di erences between these methods and our approach. In this section, we will discuss the main ideas that distinguish our scoring function from the previous works.
The main di erence is related to the cuto s for selecting atom pairs. Typically, a single cuto is used, which is independent of the identity of interacting atoms. This approach could be augmented by introducing several equally separated atom type-independent cuto s that de ne distance intervals with di erent free energy contributions. Types of the atom pairs tailed within the determined ranges of distances have a direct crucial impact to the resulting output of the scoring function. Herein, we use an opposite approach, in which distance cuto s were dependent on the type of atom pair, but it was assumed that all types of atom pairs make approximately the same contribution to the predicted energy value. The resulting distances are divided into K ranges in the K-interval versions of the scoring functions, but again, the interval breaks are individual for each type of atom pair. For example, in the scoring function SF5, the cuto breaks of 2.55 Å, 3.69 Å, 4.61 Å were used for the 'C3;C3' atom pair and 2.76 Å, 3.75 Å, 4.67 Å for 'C3;Car'. Maximum possible equality of the energy contributions of the atom pairs from the intervals with the same sequential number was requested as the main goal during the choice of the borders of the ranges in our model training. Therefore, equations 2 and 3 for the calculation of terms E and E take the total number of all atom pairs from the intervals with the same number as the variables, and the constants (a ,i , b ,i ) or a ,i were independent of the types of atom pairs.
The types of atom pairs are directly included into the scoring function via the term E . This term echoes some of the features of the other methods mentioned before, but in our model this term is estimated as a simple linear function, while similar descriptors are processed using nonlinear operations in the related machine-learning based scoring functions. The term E , to a greater extent, was inspired by the knowledgebased scoring functions (Dittrich, Schmidt, P eger, & Gohlke, 2019;Mirzaie & Sadeghi, 2010;Zheng et al., 2011) in which the distance-dependent potentials are evaluated from continuous function based on an inverse Boltzmann approach. The di erence from these potentials, besides the method of their retrieving, is that the distance-dependent increments w ij from our method are assigned to the intervals of distances. Another major di erence is that the values of the most increments w ij are close to zero, since, as it was indicated earlier, the scoring functions were designed so that the term E has low weight. Figure SI1 shows the distribution of the values of w ij in the basic scoring function, the median and the standard deviation are -0.035 and 0.69 kcal/mol, respectively.
It should be noted here that both term E and term E are taking into account the impact of the atom pairs on the binding energy. The di erence is that E accounts this factor using average value of the potential assigned to all types of atom pairsa , meanwhile w j increments are essentially deviations of the full potentials of particular atom pair types from that average impact. Therefore, these terms can be merged as follows: where W ij = a ,i + w ij . On the other side, introducing of two separate terms improves accuracy of the neural network: this e ect likely has the same origin as the e ectiveness of the batch normalization implying normalizing of each neuron into zero mean and unit variance, since the median for w ij values is zero and |w ij |<|W ij | in general. Besides, such splitting can increase the speed of the calculations since the atom pair types with low w ij can be excluded from calculations of the term E .
An appealing feature of our scoring functions is their simplicity, as they are not overloaded with supplementary terms present in other empirical scoring functions (counting hydrogen bonds and rotational bonds, taking into account partial charges or electrostatic potentials and so on (Baek, Shin, Chung, & Seok, 2017;Guedes et al., 2018;Jain, 1996;R. Wang, Lai, & Wang, 2002)) and do not contain terms from third-party scoring functions (Pereira, Ca arena, & Dos Santos, 2016;Tanchuk, Tanin, Vovk, & Poda, 2016;C. Wang & Zhang, 2017). The term E , re ecting the buriedness of ligands (Oprea & Marshall, 2005;R. Wang et al., 2002), is the only conventional empirical term. But this factor is measured in a simple way: as the linear function of the ratio of the total number of atom pairs to the number of ligand atoms. High values of the term E are characteristic for ligands deeply buried in the protein structure.
The terms E are the main contributors in the scoring functions, as they accounted for 60-70% of the overall binding energy (Table S1). The terms E and E have approximately same values of 15-30%. There is no noteworthy di erence in the proportions of the developed scoring functions with multiple intervals, the in uence of the terms are close in all models. The second term of the base model, unlike in other models, has low contribution -3%.
Due to the regression layers 3-5, our models can be presented as simple mathematical expressions, unlike other scoring functions with hidden knowledge from the machine learning techniques. This simplicity makes the calculations using the proposed scoring function more transparent, convenient, and robust, because there is no need in loading of the neural network requiring the strict preprocessing (preparing of the input data of 2D dimensionality for each complex, in our case) and additional neural-network libraries.
As to drawbacks of the model, they are similar with problems of other methods based on counting of atom pairs. Structural data of high quality is required to ensure high accuracy of the method during both training and usage of the method and only 13% of complexes of the training set have resolution better that 1.5 Å currently. Besides, 161 atom pair types are represented by less than10 instances, 72 of them have interatomic distances ranging within 0.5 Å -it is hard or impossible to determine proper values of cuto s and increments in such cases.

. Performance of the scoring functions
Performance of developed scoring functions on the test set complexes comprising mean absolute error (MAE), median error, Pearson's correlation coe cient R, standard deviation (σ), Spearman correlation coe cient (SP), Kendall correlation coe cient (τ) and predictive index (PI) are presented in Figure 5. These results were compared with the assessment data on the known scoring functions reported in (Su et al., 2019) and obtained using the same benchmark.
The scoring power of all developed scoring functions is higher than for the most known scoring functions. So, even the basic version of the scoring function with a single set of distance cuto s predicts the binding energies with the standard deviation of 2.129 kcal/mol and Pearson's correlation coe cient of 0.698, and these results are more accurate than that of AutoDock Vina, GlideScore, DrugScore scoring functions and other well-established methods (Table SI2). The scoring ability of the method improves with increased number of intervals, but quickly reaches a limit due to increasing of the number of parameters for optimization: each new interval adds 2 * number of considered atom pair types + 3 variables. Excess of the number of variables over training samples leads to tremendous over tting. So, the di erence between the standard deviations of the basic model and the best SF4 scoring function is 0.066 kcal/mol only, their performances are very close, and in fact, the ranging power of SF1 is slightly better. The observed saturation of the model quality with increased number of cuto intervals was the reason for stopping at four intervals of distances for counting of the atom pairs. The resulting scoring function was named CBSF (contacts-based scoring function) and is considered by us as the main version of the presented scoring functions. The numerical assessment of the scoring power of CBSF (σ = 2.063 kcal/mol, R=0.718, SP=0.605) indicates a high accuracy of the method, and only ∆ vina RF scoring function (Su et al., 2019;C. Wang & Zhang, 2017)shows better scoring power. Considering the simplicity and accuracy of the scheme together with its straightforward implementation, CBSF de nitely overcomes the main shortcomings characteristic for other schemes based on counting of the number of atom pairs and therefore represents a great interest for practical use in relevant chemical software.
The performances of the convergent neural networks applied for the development of the scoring functions are presented in Figure SI2. The accuracy of the neural network corresponding to CBSF (σ = 1.618 kcal/mol, R=0.733, SP=0.626) is noticeably higher than that of CBSF, but the di erences are not critical. This discrepancy is due to the features of the hard sigmoid activation function (eq. 11) used in the neural network. The best predicted binding energies are obtained using NN-SF3 model (σ = 1.612 kcal/mol, R=0.740, SP=0.640).
CBSF particularly can be implemented in docking programs or to be applied for rescoring of poses generated by other docking packages displaying lower accuracy in the scoring. The parameters of the scoring functions, such as the cuto s or the distance dependent increments can be utilized in the searching algorithms. Besides, the approach used for developing of the suggested scoring functions is generally universal and can nd a use in further experiments in the eld.

Conclusion
The new empirical scoring function named CBSF for estimating of the binding a nity of protein-ligand complexes with known three-dimensional structure is developed. CBSF outperforms most of the known scoring functions, the standard deviation obtained during its testing on CASF-2016 is 2.063 kcal/mol, while Pearson's correlation coe cient is 0.713. All parameters and coe cients of the scoring function are found by means of the neural network; this approach allowed to make considerable simpli cations in the scoring function without harm to its accuracy. d Average Spearman correlation coe cient between the experimental binding data and computed binding scores as obtained on 57 clusters. e Average Kendall correlation coe cient between the experimental binding data and computed binding scores as obtained on 57 clusters. f Average predictive index between the experimental binding data and computed binding scores as obtained on 57 clusters.