A topological approach for protein classification

Abstract Protein function and dynamics are closely related to its sequence and structure.However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics. Persistent homology is a new branch of algebraic topology that has found its success in the topological data analysis in a variety of disciplines, including molecular biology. The present work explores the potential of using persistent homology as an independent tool for protein classification. To this end, we propose a molecular topological fingerprint based support vector machine (MTF-SVM) classifier. Specifically,we construct machine learning feature vectors solely fromprotein topological fingerprints,which are topological invariants generated during the filtration process. To validate the presentMTF-SVMapproach, we consider four types of problems. First, we study protein-drug binding by using the M2 channel protein of influenza A virus. We achieve 96% accuracy in discriminating drug bound and unbound M2 channels. Secondly, we examine the use of MTF-SVM for the classification of hemoglobin molecules in their relaxed and taut forms and obtain about 80% accuracy. Thirdly, the identification of all alpha, all beta, and alpha-beta protein domains is carried out using 900 proteins.We have found a 85% success in this identification. Finally, we apply the present technique to 55 classification tasks of protein superfamilies over 1357 samples and 246 tasks over 11944 samples. Average accuracies of 82% and 73% are attained. The present study establishes computational topology as an independent and effective alternative for protein classification.


I Introduction
Proteins are essential building blocks of living organisms.They function as catalyst, structural elements, chemical signals, receptors, etc.The molecular mechanism of protein functions are closely related to their structures.The study of structure-function relationship is the holy grail of biophysics and has attracted enormous effort in the past few decades.The understanding of such a relationship enables us to predict protein functions from structure or amino acid sequence or both, which remains major challenge in molecular biology.Intensive experimental investigation has been carried out to explore the interactions among proteins or proteins with other biomolecules, e.g., DNAs and/or RNAs.In particular, the understanding of protein-drug interactions is of premier importance to human health.A wide variety of theoretical and computational approaches has been proposed to understand the protein structure-function relationship.One class of approaches is biophysical.From the point of view of biophysics, protein structure, function, dynamics and transport are, in general, dictated by protein interactions.Quantum mechanics (QM) is based on the fundamental principle, and offers the most accurate description of interactions among electrons, photons, atoms and even molecules.Although QM methods have unveiled many underlying mechanisms of reaction kinetics and enzymatic activities, they typically are computationally too expensive to do for large biomolecules.Based on classic physical laws, molecular mechanics (MM) 74 can, in combination with fitted parameters, simulate the physical movement of atoms or molecules for relatively large biomolecular systems like proteins quite precisely.However, it can be computationally intractable for macromoelcular systems involving realistic biological time scales.Many time-independent methods like normal mode analysis (NMA), 12,57,71,96 elastic network model (ENM), 3,49,68,99 graph theory 61 and flexibility-rigidity index (FRI) 81,82,107 are proposed to capture features of large biomolecules.Variational multiscale methods [26][27][28][29][30][102][103][104] are another class of approaches that combine atomistic description with continuum approximations. There arewell developed servers for predicting protein functions based on three-dimensional (3D) structures 67 or models from the homology modeling (here homology is in biological sense) of amino acid sequence if 3D structure is not yet available.91 Another class of important approaches, bioinformatical methods, plays a unique role for the understanding of the structure-function relationship.These data-driven predictions are based on similarity analysis.The essential idea is that proteins with similar sequences or structures may share similar functions.Also, based on sequential or structural similarity, proteins can be classified into many different groups.Once the sequence or structure of a novel protein is identified, its function can be predicted by assigning it to the group of proteins that share similarities to a good extent. Hoever, the degree of similarity depends on the criteria used to measure similarity or difference.Many measurements are used to describe similarity between two protein samples.Typical approaches use either sequence or physical information, or both.Among them, sequence alignment can describe how closely the two proteins are related. Prein blast, 63 clustalW2, 72 and other software packages can preform global or local sequence alignments.Based on sequence alignments, various scoring methods can provide the description of protein similarity.2,58 Additionally, sequence features such as sequence length and occurrence percentage of a specific amino acid can also be employed to compare proteins.Many sequence based features can be derived from the position-specific scoring matrix (PSSM).95 Moreover, structural information provides an efficient description of protein similarity as well.Structure alignment methods include rigid, flexible and other methods.The combination of different structure alignment methods and different measurements such as root-mean-square deviation (RMSD) and Z-score gives rise to various ways to quantify the similarity among proteins.As per structure information, different physical properties such as surface area, volume, free energy, flexible-rigidity index (FRI), 81,82,107 curvature, 46,47 electrostatics 115 etc. can be calculated.A continuum model, Poisson Boltzmann (PB) equation delivers quite accurate estimation for electrostatics of biomolecules.There are many efficient and accurate PB solvers including PBEQ, 62 MIBPB, 25,115 etc. Together with physical properties, one can also extract geometrical properties from structure information.These properties include coordinates of atoms, connections between atoms such as covalent bonds and hydrogen bonds, molecular surfaces 4,5,114 and curvatures.46,47,105 These various approaches reveal information of different scales from local atom arrangement to global architecture.Physical and geometrical properties described above add different perspective to analyze protein similarities.
Due to the advance in bioscience and biotechnology, biomolecular structure date sets are growing at an unprecedented rate.For example, the Protein Data Bank (PDB) has accumulated more than a hundred thousand biomolecular structures.The prediction of the protein structure-function relationship from such huge amount of data can be extremely challenging.Additionally, an eve-growing number of physical or sequence features are evaluated for each data set or amino-acid residue, which adds to the complexity of the data-driven prediction.To automatically analyze excessively large data sets in molecular biology, many machine learning methods have been developed. 31,48,70,76 Thse methods are mainly utilized for the classification, regression, comparison and clustering of biomolecular data.Clustering is an unsupervised learning method which divides a set of inputs into groups without knowing the groups beforehand.This method can unveil hidden patterns in the data set.Classification is a supervised learning method, in which, a classifier is trained on a given training set and used to do prediction for new observations.It assigns observation to one of several pre-determined categories based on knowledge from training data set in which the label of observations is known.Popular methods for classification include support vector machine (SVM), 13 artificial neural network (ANN), 75 deep learning, 59 etc.In classification, each observation in training the set has a feature vector that describes the observation from various perspectives and a label that indicates to which group the observation belongs.A model trained on the training set indicates to which group a new observation belongs with feature vector and unknown label.To improve the speed of classification and reduce effect from irrelevant features, many feature selection procedures have been proposed 38 .Machine learning approach are successfully used for protein hot spot prediction. 37he data-driven analysis of the protein structure-function relationship is compromised by the fact that same protein may have different conformations which possess different properties or delivers different functions.For instance, hemoglobins have taut form with low affinity to oxygen and relaxed form with high affinity to oxygen; and ion channels often have open and close states.Different conformations of a given protein may only have minor differences in their local geometric configurations.These conformations share the same sequence and may have very similar physical properties.However, their minor structural differences might lead to dramatically different functions.Therefore, apart from the conventional physical and sequence information, geometric and topological information can also play an important role in understanding the protein structure-function relationship.Indeed, geometric information has been extensively used in the protein exploration.In contrast, topological information has been hardly employed in studying the protein structure-function relationship.
In general, geometric approaches are frequently inundated with too much geometric detail and are often prohibitively expensive for most realistic biomolecular systems, while traditional topological methods often incur in too much reduction of the original geometric and physical information.Persistent homology, a new branch of applied topology, is able to bridge traditional geometry and topology.It creates a variety of topologies of a given object by varying a filtration parameter, such as a radius or a level set function.In the past decade, persistent homology has been developed as a new multi-scale representation of topological features.The 0-th dimensional version was originally introduced for computer vision applications under the name "size function" 51,52 and the idea was also studied by Robins. 90The Persistent homology theory was formulated, together with an algorithm given, by Edelsbrunner et al., 43 and a more general theory was developed by Zomorodian and Carlsson. 11678,79,83,97 Often, persistent homology can be visualized through barcodes, 20,56 in which various horizontal line segments or bars are the homology generators which survive over filtration scales.Persistence diagrams are another equivalent representation.42 Computational homology and persistent homology have been applied to a variety of domains, including image analysis, 7,17,53,84,93 chaotic dynamics verification, 64,77 sensor network, 92 complex network, 60,69 data analysis, 14,73,80,89,100 shape recognition 1,41 and computational biology. 36,55,65,85,86 Compard with traditional computational topology 22,66,113 and/or computational homology, persistent homology inherently has an additional dimension, the filtration parameter, which can be utilized to embed some crucial geometric or quantitative information into the topological invariants.,54,56 Recently, we have introduced persistent homology for mathematical modeling and prediction of nano particles, proteins and other biomolecules.106,108 We have proposed molecular topological fingerprint (MTF) to reveal topology-function relationships in protein folding and protein flexibility. More recently, we have introduced resolution based persistent topology.111,112 Most recently, we have developed new multidimensional persistence, a topic that has attracted much attention in the past few years, 18,19 to better bridge geometry and traditional topology and achieve better characterization of biomolecular data.109 We have also introduced the use of topological fingerprint for resolving ill-posed inverse problems in cryo-EM structure determination.110 The objective of the present work is to explore the utility of MTFs for protein classification and analysis.We construct feature vectors based on MTFs to describe unique topological properties of protein in different scales, states and/or conformations.These topological feature vectors are further used in conjugation with the SVM algorithm for the classification of proteins. We validate the proposed MTF-SVM strategy by distinguishing different protein cormations, proteins with different local secondary structures, and proteins from different superfamilies or families.The performance of proposed topological method is demonstrated by a number of realistic applications, including protein binding analysis, ion channel study, etc.
The rest of the paper is organized as following.Section II is devoted to the mathematical foundations for persistent homology and machine learning methods.We present a brief description of simplex and simplicial complex followed by basic concept of homology, filtration, and persistence in Section II.A. Three different methods to get simplicial complex, Vietoris-Rips complex, alpha complex, and Čech complex are discussed.We use a sequence of graphs of channel proteins to illustrate the growth of a Vietoris-Rips complex and corresponding barcode representation of topological persistence.In Section II.B, fundamental concept of support vector machine is discussed.An introduction of transformation of the original optimization problem is given.A measurement for the performance of classification model known as receiver operating characteristic is described.Section II.C is devoted to the description of features used in the classification and pre-processing of topological feature vectors.In section III, four test cases are shown.Case 1 and Case 2 examine the performance of the topological fingerprint based classification methods in distinguishing different conformations of same proteins.In Case 1, we use the structure of the M2 channel of influenza A virus with and without an inhibitor.In Case 2, we employ the structure of hemoglobin in taut form and relaxed form.Case 3 validates the proposed topological methods in capturing the difference between local secondary structures.In this study, proteins are divided into three groups, all alpha protein, all beta protein, and alpha+beta protein.In Case 4, the ability of the present method for distinguishing different protein families is examined.This paper ends with some concluding remarks.

II Materials and Methods
This section presents a brief review of persistent homology theory and illustrates its use in proteins.A brief description of machine learning methods is also given.The topological feature selection and construction from biomolecular data are described in details.

II.A Persistent homology
Points, edges, triangles and their higher dimensional counterparts are defined as simplices.A simplicial space is a topological space constructed from finitely many simplices.
where {u 0 , u 1 , ..., u k } ⊂ R n is a set of affinely independent points.Geometrically, a 1-simplex is a line segment, a 2-simplex is a triangle, a 3-simplex is a tetrahedron, and a 4-simplex is a 5-cell (a four dimensional object bounded by five tetrahedrons).A m−face of the k-simplex is defined as a convex hull formed from a subset consisting m vertices.
Simplicial complex A simplicial complex K is a finite collection of simplices satisfying two conditions.First, faces of a simplex in K are also in K; Second, intersection of any two simplices in K is a face of both the simplices.The highest dimension of simplices in K determines dimension of K.
Homology For a simplicial complex K, a k-chain is a formal sum of the form For simplicity, we choose c i ∈ Z 2 .All these k-chains on K form an Abelian group, called chain group and denoted as C k (K).A boundary operator ∂ k over a k-simplex σ k is defined as, where [u 0 , u 1 , ..., u i , ..., u k ] denotes the face obtained by deleting the ith vertex in the simplex.The boundary operator induces a boundary homomorphism An very important property of the boundary operator is that the composition operator A sequence of chain groups connected by boundary operation form a chain complex, The equation The kth Betti number of simplicial complex K is the rank of H k , Betti number β k is finite number, since rank(B p ) ≤ rank(Z p ) < ∞.Betti numbers computed from a homology group are used to describe the corresponding space.Generally speaking, the Betti numbers β 0 , β 1 and β 2 are numbers of connected components, tunnels, and cavities, respectively.Filtration and persistence A filtration of a simplicial complex K is a nested sequence of subcomplexes of K.
With a filtration of simplicial complex K, topological attributes can be generated for each member in the sequence by deriving the homology group of each simplicial complex.The topological features that are long lasting through the filtration sequence are more likely to capture significant property of the object.Intuitively, non-boundary cycles that are not mapped into boundaries too fast along the filtration are considered to be possibly involved in major features or persistence.Equipped with a proper derivation of filtration and a wise choice of threshold to define persistence, it is practicable to filter out topological noises and acquire attributes of interest.The p-persistent kth homology group of K i is defined as where A well chosen p promises reasonable elimination of topological noise.
(a) Vietoris-Rips complex Based on a metric space M and a given cutoff distance d, an abstract simplicial complex can be built.If two points in M have a distance shorter than the given distance d, an edge is formed between these two points.Consequently, simplices of different dimensions are formed and a simplicial complex is built.For a point cloud data, natural metric space based on Euclidean distance or other metric spaces based on alternative definition of distance can be used to build a Vietoris-Rips complex.For example, any correlation matrix can be used directly to form a Vietoris-Rips complex.Figure 1 illustrates growth of Vietoris-Rips complex along with increment of d over the point set of C α atoms from M2 chimera channel.
There are many ways of constructing complex other than Vietoris-Rips complex, including Alpha complex, Cech Complex, CW complex, etc.In the present work, we used Vietoris-Rips complex in part because of its intuitive nature and in part because of the moderate size of the systems we studied.The computational topology package JavaPlex 98 was used for computation of persistent homology.The results were represented in the form of barcodes. 56 subject to where x i denotes the feature vector of the ith sample, y i is the label of the ith sample which takes value of either 1 or −1, and C is a penalty coefficient for misclassified points.To handle linearly inseparable data, one maps the data into a higher dimensional space as φ : R N → R M with N < M .Since in the optimization problem and in scoring function of the classifier, the operator used is dot product, φ does not need to be explicitly found.A decaying kernel K(x i , x j ) function is used to represent φ T (x i )φ(x j ).
Commonly used kernel functions include linear function: In fact, the admissible kernels of fleibility-rigidity index (FRI) 81,82,107 work too.In this work, The Gaussian kernel is used and a 5-fold cross validation was applied to search for optimized training parameters for problems with large amount of samples.To solve the optimization problem, the original problem is transformed into the corresponding Lagrange dual problem.For a contained optimization problem min x f (x), the Lagrange function of this problem is defined as where α and λ are Lagrange multipliers.The Lagrange dual problem is defined as max α,λ θ(α, λ) where θ(α, λ) = inf x∈Ω L(x, α, λ).The Lagrange function of the original optimization problem ( 9) is formulated as Tthe corresponding dual problem with Karush-Kuhn-Tucker conditions is defined as The dual problem can be solved with sequential minimal optimization (SMO) method. 44eceiver operating characteristic (ROC) ROC is a plot that visualizes the performance of a binary classifier. 45A binary classifier uses a threshold value to decide the prediction label of an entry.In testing process, we define true positive rate (TPR) and false positive rate (FPR) for the testing set.
TPR = (number of positive samples predicted as positive)/(number of positive samples) FPR = (number of nagetive samples predicted as positive)/(number of nagetive samples) An ROC space is a two dimensional space defined by points with x coordinate representing FPR and y coordinate representing TPR.In the prediction process of a binary classifier, a score is assigned to a sample by the classifier.A test sample may be labeled as positive or negative with different threshold value used by the classifier.Corresponding to a certain threshold value, there is a pair of FPR and TPR values which is a point in the ROC space.All such points will fall in the box [0, 1] × [0, 1].Points above the diagonal line y = x are considered as good predictors and those below the line are considered as poor predictors.If a point is below the diagonal line, the predictor can be inverted to be a good predictor.For points that are close to the diagonal line, they are considered to act similarly to random guess which implies a relatively useless predictor.ROC curve is obtained by plotting FPR and TPR as continuous functions of threshold value.The area between ROC curve and x axis represents probability that the classifier assigns higher score to a randomly chosen positive sample than to a randomly chosen negative sample if positive is set to have higher score than negative.The area under the curve (AUC) of ROC is a measure of classifier quality.Intuitively, a higher AUC implies a better classifier.

II.C Topological feature selection and construction
In this work, algebraic topology is employed to discriminate proteins.Specifically, we compute MTFs through the filtration process of protein structural data.MTFs bear the persistence of topological invariants during the filtration and are ideally suited for protein classification.To implement our topological approach in the SVM algorithm, we construct protein feature vectors by using MTFs.We select distinguishing protein features from MTFs.These features can be both long lasting and short lasting Betti 0, Betti 1, and Betti 2 intervals.Table 1 lists topological features used for classification.Detailed explanation of these features is discussed.The length and location value of bars are in the unit of angstrom (Å) for protein data.The average length of Betti 1 bars except for those exceed the max filtration value.13  2 The onset value of the first Betti 2 bar that ends after a given number.
• Feature 1: The length of the second longest Betti 0 bar indicates the onset in filtration that the simplices in the corresponding complex form one connected component.
• Feature 2: Similar to Feature 1, this value indicates the onset in filtration that the simplices form two connected components.For the more complicated point cloud, the more features of this kind may be utilized.
• Feature 3: Geometrically, the total length of Betti 0 bars describes how compactly the points are located.
• Feature 4: This averaged Betti 0 bar length shows similar property as that in Feature 3 with no correlation to atom number.
• Feature 5: This value shows the filtration value at which, the largest persistent loop is formed.
• Feature 6: The persistence of the longest Betti 1 bar reflects the size of the geometrically dominating loop.
• Feature 7: A Betti 1 bar with length larger than the threshold is considered to be important and this feature records the onset filtration value of such a long bar.In this work, a threshold of 1.5 is used for α-carbon point cloud data of proteins.
• Feature 8: This feature records the average location of midpoints of Betti 1 bars which are longer than the threshold value discussed in Feature 5.This value shows at which filtration value the loops are centered.
• Feature 9: This feature indicates the portion of alpha helices in a protein.For each four α-carbons on a alpha helix, they are likely to form a short Betti 1 bar around filtration value 5Å.A bar is considered to be short if it has length shorter than 0.5Å and to be around 5Å if the distance from its midpoint to 5Å is less than 0.6Å.
• Feature 10: Similar to Feature 7, this feature can be used to identify portion of beta sheets.Detailed discussion of Features 7 and 8 can be found in Ref. 108 • Feature 11: A strong correlation between accumulation bar length of Betti 1 and total energy has been reported. 108Feature 12: The average value of Betti 1 bars correlates to the average loop size.
• Feature 13: The smallest onset value of the Betti 2 bar that ends after a given value.This feature gives information about birth and death of cavities in the complex through filtration.
Each feature was scaled to the interval [0, 1] with linear mapping.

III Results
In this section, we validate the proposed idea, examine the accuracy, explore the utility of the proposed topology based classification and analysis of protein molecules.We consider four different types of problems.In our first case, we study a protein-drug binding problem, namely, the drug inhibition of Influenza A virus M2 channels.In our second case, we use MTFs to classify two type of conformations of hemoglobin proteins.Default parameters were used and brute force cross validation was performed for these first two cases due to their relatively small size of samples.We further consider the classification of three types of protein domains, i.e., all alpha domains, all beta domains and mixed alpha and beta domains.Finally, our method was tested on a problem set, PCB00019, from Protein Classification Benchmark Collection. 94In the last two cases, a grid search with cross validation on training sets was performed to optimize SVM parameters.For the last case, different penalty parameters were applied to overcome the unbalanced data and an ROC analysis was used to evaluate the results.Data for the M2 channels are all obtained from NMR experiments. 88Data for hemoglobin structures are all collected from X-ray crystallography.Structure data for the last two test cases are mostly attained from X-ray crystallography.However, a few structures were determined by NMR techniques and thus have many alternatives.In this situation, we select the second structure for each sample in the data base.
In this work, we utilize JavaPlex 98 to compute MTFs.For implementation of support vector machine, LIBSVM is employed. 21

III.A Protein-drug binding analysis
Proteins are vital to many processes in cells.In many biologic processes, protein may bind to other molecules.Protein-protein interaction and protein-ligand interaction are of crucial importance to their functions and/or malfunction.These interactions have been intensively exploited in drug design.Specifically, many drugs bind to target proteins to modify their functions and activities.After binding to other molecule, a protein usually experiences a structure change at the binding site.In many cases, it may also undergo allosteric process with a global structural change upon the binding.We test our method for distinguishing proteins with drug bound from proteins without drugs.
We use M2 channel, which is a transmembrane protein found in influenza A virus, 87 as an example.M2 channel equilibrates pH across the viral membrane during cell entry and palys a vital role in the viral replication.Therefore, it is used a target for the anti-influenza drugs, i.e., amantadine and rimantadine, which bind to the M2 channel pore and thus block the proton permeation.The drug binding creates a topological change to the M2 channel in the conventional sense.However, in the present work, it is not the topological change itself, rather that the binding induced geometric variation of the M2 channel that is converted into the change in the topological invariants.Such a change is recorded in our MTFs and utilized for protein classification.The structures of chimera channels with and without rimantadine were used for classification.PDB IDs of the two structures are 2LJC for channel with the inhibitor and 2LJB for channel without the inhibitor. 88The structures are shown in Figure 3(a)-(b).Note that inhibit itself is not included in our filtration.A total of 15 snapshots from NMR for each structure is used to perform classification.Due to small size of instances, default parameter in C-SVC with penalty C = 2 and γ = 1/(number of features) were used.Each time, 10 instances from each class were set as the training set and the rest were set as the testing set.A brute-force cross validation was performed.The average accuracy for unbound form is 93.91% and accuracy for bound form is 98.31%.Due to small size of testing set, AUC value was not calculated in this example.III.B Discrimination of hemoglobin molecules in relaxed and taut forms Hemoglobin is oxygen transport metallprotein in red blood cells of most vertebrates.It carries oxygen from lungs or gills to other organs or parts in the body.Oxygen is released to tissues and is used for metabolism.Hemoglobin is also known to carry carbon dioxide in some cases.It exits in two forms, known as taut (T) form and relaxed (R) form.Examples of these two forms are shown in Figure 4(a)-(b).Relaxed form has a high oxygen binding affinity with which hemoglobin can better bind to oxygen in lungs or gills.Taut form has a low oxygen binding affinity which helps release the oxygen in the rest of the body.Many factors affect the conformation form of hemoglobin, such as pH value, concentration of carbon dioxide and partial pressure in the system.Structurally, the two forms are slightly different.In this test case, we picked 9 structures of hemoglobin in R form and 10 structures of hemoglobin in T form from protein data bank.In this test case, as the number of instances is relatively small, a brute-force cross validation was performed with the same default parameters as in last case.Each time one instance from each class were picked as the test set leaving the rest instances as the training set.The average accuracy of the prediction for test set is 84.50%.The average accuracy of R form is 77.16% and average accuracy of T form is 91.11%.Since test set size is small, ROC analysis was not applied in this case.III.C The classification of all alpha, all beta, and mixed alpha and beta protein domains Protein secondary structures are three dimensional patterns of protein local segments.Common secondary structures include alpha helices and beta sheets.These local structures are formed by hydrogen bonds between amine hydrogen and carbonyl oxygen atoms in the backbone of a protein.Typically, secondary structures can be identified from amino acid sequence data.In this test example, we use only geometric data without sequence information to generate MTFs and then classify alpha helices and beta sheets.Instances of this example were taken from SCOPe (Structural Classification of Proteinsextended) database. 50The SCOPe ID (SID) of samples used in this test case are listed in Tables 3,4,  and 5.
In this test case, protein domains were separated into three classes, namely, all alpha helix domains, all beta domains, and mixed alpha and beta domains.Examples for each of three classes are shown in Figures 5(a)-(c) and their barcode plots are shown in Figures 5(d)-(i).For each class in SCOPe, 300 structures from different superfamilies were used for classification.Among the 900 instances, 60 from each class were used as the test set and the rest were used as the training set.A 5-fold cross-validation was performed to test accuracy.In each training process, a 5-fold cross validation in the training set was carried out to optimize training parameters.The overall accuracy is 84.93%.Specifically, the accuracy for all alpha helix domains is 90.67%, the accuracy for mixed alpha and beta domains is 78.77%, and accuracy for all beta domains is 83.31%.

III.D Classification of protein superfamilies
Figure 6: The ROC curves corresponding to the 55 tasks.Plot was generated using LIBSVM tools 21 A protein superfamily is the largest collection of proteins for which a common ancestor can be traced.Within a superfamily, similarity between amino acid sequences may not be easily observed.Therefore, a superfamily can be further divided into several families within which, similarity among amino acid sequences usually can be identified.Members in a protein superfamily share similar structure with the common ancestor though they may not have similar sequences.In this case, based on structure information, we test our method in classification of protein superfamilies.The samples in this test case were taken from Protein Classification Benchmark Collection. 94The problem used in our test has the accession number of PCB00019.The goal of this data set is to classify protein domain sequences and structures into protein superfamilies, based on protein families.It contains 1357 samples and 55 classification tasks.Detailed description and classification results using different scoring method and various classification methods can be found in Protein Classification Benchmark Collection website.In this test, we utilize only the structure information of α-carbon in protein backbones.For each task, we perform 5 fold cross validation on the training set to search for reasonable parameters.In most tasks, the numbers of positive and negative sets are unbalanced.To prevent unbalanced training results, different values for penalty parameters are used.Specifically, the ratio between positive penalty parameter and negative penalty parameter is set to equal the ratio between number of negative instances and number of positive instances.The average accuracy for the positive testing set and the negative testing set are 82.29% and 80.94%, respectively.The average AUC value for the 55 tasks is 0.8954.The standard derivation of AUC for the 55 tasks is 0.09. Figure 6 shows plots of the ROC curve for the 55 tasks.

IV Discussion and Conclusion
Persistent homology is a unique tool in computational topology and computational geometry.It explores the topology space by studying the evolution of simplicial complex over a filtration process of a given data set.A nested sequence of subsets are obtained by continuously increasing the filtration parameter.During the filtration process, birth and death of topological invariants are recorded.The lifespan of a topological invariant shows how significant it is geometrically.Persistent homology is capable of discovering the underlying topological feature of the space of interest and recognizing topo-logically small events.In other words, it gives not only information of global and significant topological features, but also perspective of local features of the underlying space.Persistent homology has been applied to computer graphics, geometric modeling, data analysis, and many other fields.A protein structure can be represented as point cloud in three dimension for atoms or graph with edges corresponding to different types of chemical bonds.This geometric nature of protein structures allows the application of persistent homology.In this work, we introduce the use of protein topological features captured by persistent homology for the protein classification.Our goal is to illustrate that molecular topological fingerprints (MTFs) can describe the structure of a protein from different perspectives and in different scales.This property of MTFs makes it possible to be used in protein classification from the topological point of view.We examine the performance of MTFs in several protein classification tasks with different emphasizes.We show that MTFs are a potential option for protein classification.
To introduce the topological features we used in classification, we briefly reviewed the definition of simplex and different types of simplicial complex.Basic concepts of filtration and persistence was recalled.We use α-carbon atoms in M2 proton channel of influenza A to illustrate the filtration of a simplicial complex.We also showed the barcode plots for M2 channel in an all-atom model and an αcarbon model.Comparing these approach, it can be seen that all-atom model contains too many details which flood away useful information such as Betti 1 barcode representing alpha helices.Essentially, at the all-atom scale, different proteins have some common features due to the structures of amino acids.Using a coarse-grained model with α-carbon atoms reveals more information of the entire structure of the protein and dramatically reduces spatial complex and computational time.Therefore, we adapt coarse-grained model throughout this work.In some physical descriptions of proteins, an all-atom model may be preferred.
In persistent homology, a convention is to cherish long-persistent topological features which are presented as long-lived bars in a barcode representation.Whereas, short-lived barcodes are typically discharged as noise.In our case, the MTFs of proteins carry both global features and local traits.For protein analysis, both global features and local traits are equally important.In other words, it takes both long-lived topological features and short-lived topological traits to effectively characterize different proteins.A fundamental reason is that biomolecular structure, function, dynamics and transport are governed by the interactions of wide range of scales, which lead to multiple characteristic length scales ranging from covalent bond, residue, secondary structure and domain dimensions to protein sizes.Based on our understanding of protein characteristic length scales, 109 we are able to identify the responding protein topological fingerprints and determine their relevance and importance in protein classification.
To use of MTFs for the analysis of large scale biomoleculoar data, we have developed persistent homology based machine learning method.Essentially, we construct feature vectors by using MTFs.We utilize the support vector machine (SVM) algorithm, which is known for its robustness and high accuracy, in our study.The resulting MTF-SVM classifier is validated by four test cases.First, we explore the performance of the present MTF-SVM classifier for distinguishing drug bound M2 channels of influenza A virus from those of nature M2 channels.It is found that the proposed method does an excellent job in analyzing viral drug inhibition.A 96% prediction accuracy is recorded.In our second test, we consider the discrimination of hemoglobin molecules in their relaxed and taut forms.Again, the present approach works very well (80% accuracy) for this problem.We further employ our MTF-SVM classifier for the identification of all alpha, all beta, and alpha-beta protein domains.A total of 900 proteins is used in our study.Due to the relatively large sample size, a 5-fold cross-validation was carried out to optimize training parameters and validate the present method.In this study, the detailed local topological features facilitate the classification of proteins with different secondary structures.An average of 85% accuracy is found over three protein classes.Finally, we utilize the present method for the classification of protein superfamilies.We adapt a standard test, accession number PCB00019, from Protein Classification Benchmark Collection. 94It involves 1357 samples and 55 classification tasks.A combination of both local and global topological features enables us to separate protein superfamilies.Based on 5-fold cross validation, an average classification accuracy of 82% is found.
The objective of the present work is to examine the utility, accuracy and efficiency of computational topology for protein classification.As such, only topological information is employed.The extensive test study establishes topology as an independent and valuable option for large scale protein classification.Obviously, the present method can be improved in a variety of ways.Specifically, one can combine topological features with other more established features, namely, sequence features and physical features.for protein analysis and classification.Indeed, MTFs computed from persistent homology differ sharply from sequence and physical based features.Therefore, a combination of topological features, sequence features, and physical features must be able to take advantages of these three classes of methods.This aspect is beyond the scope of the present work and will be explored in our future research.
In our earlier work, we have introduced computational topological for mathematical modeling and prediction, such as molecular stability prediction, 108 protein folding analysis, 112 and protein bond length prediction. 109The present work indicates that the combination of machine learning and computational topology will create a new powerful approach topology based mathematical modeling and prediction.
Appendix: Instances used in Section III.C In this appendix we list protein SCOPe IDs used in Section III.C.

Figure 2 :
Figure 2: Bar code plots of persistent homology calculated for α-carbon and all atom point cloud of M2 chimera channel of influenza A virus based in Vietoris-Rips complex.(a) and (b) are respectively Betti 0 and Betti 1 bar code plots for point cloud of α-carbon.(c) and (d) are respectively Betti 0 and Betti 1 bar code plots for all atom point cloud.II.B Support vector machineBasic theory SVM is a machine learning method that can be applied to classification and regression problems.It computes a hyperplane which maximizes margin between positive and negative training sets.In this work, Classification SVM Type 1, also known as C-support vector classification (C-SVC)35 is used.For the problem of classification, with pre-determined classes, a classifier is trained on a data set with the description of samples and their classes and it predicts the class of a new observation.The input for SVM is a set of samples.Each sample has a feature vector that describes the properties of the sample and a label that implies to which class the sample belongs.Given the input which is the training set, SVM will generate a hyperplane in the feature space or higher dimensional spaces depending on which kernel it uses that separates the classes.For two-class SVM, it looks for a hyperplane w T x + b = 0 that separates the classes.The determination of the coefficients w and b breaks down to a constrained optimization problem as min w,b

Feature # Betti # Description 1 0 1 1 1 1 1
The length of the second longest Betti 0 bar. 2 0 The length of the third longest Betti 0 bar. 3 0 The summation of lengths of all Betti 0 bars except for those exceed the max filtration value.4 0 The average length of Betti 0 bars except for those exceed the max filtration value.5 The onset value of the longest Betti 1 bar.6 The length of the longest Betti 1 bar.7 The smallest onset value of the Betti 1 bar that is longer than 1.5Å.8 The average of the middle point values of all the Betti 1 bars that are longer than 1.5Å.9 The number of Betti 1 bars that locate at [4.5, 5.5]Å, divided by the number of atoms.10 1 The number of Betti 1 bars that locate at [3.5, 4.5)Å and (5.5, 6.5]Å, divided by the number of 11 1 The summation of lengths of all the Betti 1 bars except for those exceed the max filtration val 12 1

Figure 3 :
Figure 3: Protein structures used in M2 channel classification.(a) (PDB ID: 2LJB 88 ) M2 channel of influenza A without inhibitor.(b) (PDB ID: 2LJC 88 ) M2 channel of influenza A with inhibitor.The small molecule in the graph is for illustration and was not used in classification.(c), (d) and (e) are respectively Betti 0, Betti 1, and Betti 2 barcodes for (a).(f), (g) and (h) are respectively Betti 0, Betti 1, and Betti 2 barcodes for (b).

Figure 5 :
Figure 5: Example plots of different protein domains.(a) All alpha protein.(b) Alpha and beta protein.(c) All beta protein.(d) and (g) are respectively example Betti 0 and Betti 1 barcodes for all alpha protein.(e) and (h) are respectively example Betti 0 and Betti 1 barcodes for alpha+beta protein.(f) and (i) are respectively example Betti 0 and Betti 1 barcodes for all beta protein.
Im∂ k are called kth boundary group, and denoted as B k =Im∂ k+1 .A kth homology group is defined as the quotient group of Z k and B k .
when Im and Ker denotes image and kernel.Elements of Ker∂ k are called kth cycle group, and denoted as Z k =Ker∂ k .Elements of

Table 1 :
A list of features used in support vectors Table 2 lists of PDB IDs used.

Table 2 :
Protein molecules used for the Hemoglobin classification.

Table 3 :
All alpha proteins