Pattern recognition approach to classifying CYP 2C19 isoform

Bartosz Krawczyk 1
  • 1 Department of Systems and Computer Networks, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland


In this paper a pattern recognition approach to classifying quantitative structure-property relationships (QSPR) of the CYP2C19 isoform is presented. QSPR is a correlative computer modelling of the properties of chemical molecules and is widely used in cheminformatics and the pharmaceutical industry. Predicting whether or not a particular chemical will be metabolized by 2C19 is of primary importance to the pharmaceutical industry. This task poses certain challenges. First of all analyzed data are characterized by a significant biological noise. Additionally the training set is unbalanced, with objects from negative class outnumbering the positives four times. Presented solution deals with those problems, additionally incorporating a throughout feature selection for improving the stability of received results. A strong emphasis is put on the outlier detection and proper model validation to achieve the best predictive power.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1]

  • [2] Gasteiger J., Funatsu K., Chemoinformatics-An Important Scientific Discipline, Journal of Computational Chemistry Jpn., 2006, Vol. 5, No. 2:53–58

  • [3] Chawla N.V., Bowyer K.W., Hall L.O. and Kegelmeyer W.P., SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 2002, Volume 16:321–357

  • [4] Chawla N.V., Lazarevic A., Hal L.O. and Bowyer K.W., Smoteboost: improving prediction of the minority class in boosting, Proceedings of the Principles of Knowledge Discovery in Databases, 2003, PKDD-2003:107–119

  • [5] Han H., Wang W., and Mao B., Borderline-smote: A new over-sampling method in imbalanced data sets learning, Lecture Notes in Computer Science, 2005, vol. 3644:878–887

  • [6] Köknar-Tezel S., Latecki L.J., Improving SVM classification on imbalanced time series data sets with ghost points, Knowledge and Information Systems, 2010, DOI: 10.1007/s10115-010-0310-3

  • [7] Wang B.X., Japkowicz N., Boosting Support Vector Machines for Imbalanced Data Sets, Lecture Notes in Computer Science, 2008, Volume 4994/2008:38–47

  • [8] Li B.Y., Peng J., Chen Y.Q. and Jin Y.Q., Classifying Unbalanced Pattern Groups by Training Neural Network, Lecture Notes in Computer Science, 2006, Volume 3972/2006:8–13

  • [9] Zhao Z., Huang D., An evolutionary modular neural network for unbalanced pattern classifications, Evolutionary Computation, 2007, CEC 2007:1662–1669

  • [10] Gasteiger J.(Editor), Handbook of Chemoinformatics — From Data to Knowledge, Wiley-VCH, 2003

  • [11] Lindsay K.R., Buchanan B.G., Feigenbaum E.A., Lederberg J., Applications of Artificial Intelligence for Organic Chemistry; the DendralProject, McGraw-Hill, New York, 1980

  • [12] Brown F., Editorial Opinion: Chemoinformatics-a ten year update, Current Opinion in Drug Discovery & Development, 2005, 8(3):296–302

  • [13] Anoyama, T., Suzuki, Y., Ichikawa, H., Neural networks applied to structure-active relationships. Journal of Medicinal Chemistry. 1990, 33, 905–908

  • [14] King, R. D., Hirst, J. D., Sternberg, M. J. E., Comparison of artificial intellogence methods for modeling pharmaceutical QSARs. Applied Artificial Intelligence, 1995, 9, 213–233

  • [15] Liu, Y., A comparative study on feature selection methods for drug discovery. Journal of Chem. Inf. Comput. Sci., 2004, 44, 1823–1828

  • [16] Burbidge, R., Trotter, M., Buxton, B., Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers and Chemistry, 2001, 26, 5–14

  • [17] Duda R.O., Hart P.E., Stork D.G., Pattern Classification, Wiley-Interscience, 2001

  • [18] Vapnik V., Statistical Learning Theory, Willey 1998

  • [19] Williams, C. K. I., Barber, D., Bayesian classification with Gaussian Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20, 1342–1351

  • [20] Crammer, K., Singer, Y., On the algorithmic implementation of multiclass kernel-based vector machines, Journal of Machine Learning Research, 2001, 2, 265–292

  • [21] Redman T. C., Data Quality. The Field Guide, Boston Digital Press, 2001

  • [22] Ben-Gal I., Outlier detection, Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic Publishers, 2005

  • [23] Guyon I., Gunn S., Nikravesh M. and Zadeh L., Feature extraction, foundations and applications, Springer, 2006

  • [24] Yu L., Liu H., Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 1205–1224

  • [25]

  • [26] Karatzoglou A., Smola A., Hornik K., Zeileis A., Kernlab — An S4 Package for Kernel Methods in R, Journal of Statistical Software, 2004, 11(9)

  • [27] Karatzoglou A., Meyer D., Hornik K., Support Vector Machines in R, Journal of Statistical Software, 2006, 15(9)

  • [28] Alpaydin, E., Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms, Neural Computation, 1998, 11:1885–1892


Journal + Issues