Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 5, 2021

Machine learning based disease prediction from genotype data

Nikoletta Katsaouni ORCID logo, Araek Tashkandi, Lena Wiese and Marcel H. Schulz
From the journal Biological Chemistry


Using results from genome-wide association studies for understanding complex traits is a current challenge. Here we review how genotype data can be used with different machine learning (ML) methods to predict phenotype occurrence and severity from genotype data. We discuss common feature encoding schemes and how studies handle the often small number of samples compared to the huge number of variants. We compare which ML methods are being applied, including recent results using deep neural networks. Further, we review the application of methods for feature explanation and interpretation.

Corresponding author: Marcel H. Schulz, Institute for Cardiovascular Regeneration, Goethe University, 60590Frankfurt am Main, Germany; German Center for Cardiovascular Research (DZHK), Partner Site RheinMain, 60590Frankfurt am Main, Germany; and Cardio-Pulmonary Institute, Goethe University Hospital, Frankfurt am Main, Germany, E-mail:

Funding source: DFG Cluster of Excellence Cardio Pulmonary Institute (CPI)

Award Identifier / Grant number: EXC 2026

Funding source: Alfons und Gertrud Kassel-Stiftung "Center for Data Science and AI"

  1. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This project is part of the "Center for Data Science and AI" funded by the Alfons und Gertrud Kassel-Stiftung. This work was supported by the DFG Cluster of Excellence Cardio Pulmonary Institute (CPI) [EXC 2026].

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.


Aguiar-Pulido, V., Seoane, J.A., Rabuñal, J.R., Dorado, J., Pazos, A., and Munteanu, C.R. (2010). Machine learning techniques for single nucleotide polymorphism–disease classification models in schizophrenia. Molecules 15: 4875–4889, in Google Scholar

Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., and Zondervan, K.T. (2010). Data quality control in genetic case-control association studies. Nat. Protoc. 5: 1564–1573, in Google Scholar

Ani, A., van der Most, P.J., Snieder, H., Vaez, A., and Nolte, I.M. (2021). Gwasinspector: comprehensive quality control of genome-wide association study results. Bioinformatics 37: 129–130, in Google Scholar

Badré, A., Zhang, L., Muchero, W., Reynolds, J.C., and Pan, C. (2020). Deep neural network improves the estimation of polygenic risk scores for breast cancer. J. Hum. Genet. 66: 1–11, in Google Scholar

Baumgarten, N., Hecker, D., Karunanithi, S., Schmidt, F., List, M., and Schulz, M.H. (2020). EpiRegio: analysis and retrieval of regulatory elements linked to genes. Nucleic Acids Res. 48: W193–W199, in Google Scholar

Bellenguez, C., Charbonnier, C., Grenier-Boley, B., Quenez, O., Le Guennec, K., Nicolas, G., Chauhan, G., Wallon, D., Rousseau, S., Richard, A.C., et al.. (2017). Contribution to Alzheimer’s disease risk of rare variants in trem2, sorl1, and abca7 in 1779 cases and 1273 controls. Neurobiol. Aging 59: 220–e1, in Google Scholar

Bellot, P., de Los Campos, G., and Pérez-Enciso, M. (2018). Can deep learning improve genomic prediction of complex human traits? Genetics 210: 809–819, in Google Scholar

Boyle, A.P., Hong, E.L., Hariharan, M., Cheng, Y., Schaub, M.A., Kasowski, M., Karczewski, K.J., Park, J., Hitz, B.C., Weng, S., et al.. (2012). Annotation of functional variation in personal genomes using regulomedb. Genome Res. 22: 1790–1797, in Google Scholar

Boyle, E.A., Li, Y.I., and Pritchard, J.K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169: 1177–1186, in Google Scholar

Bracher-Smith, M., Crawford, K., and Escott-Price, V. (2020). Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol. Psychiatr. 26: 1–10, in Google Scholar

Browning, B.L. and Browning, S.R. (2009). A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84: 210–223, in Google Scholar

Buniello, A., MacArthur, J.A., Cerezo, M., Harris, L.W., Hayhurst, J., Malangone, C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al.. (2018). The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47: D1005–D1012, in Google Scholar

Chen, I.Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M. (2020). Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, in Google Scholar

Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way, G.P., Ferrero, E., Agapow, P.-M., Zietz, M., Hoffman, M.M., et al.. (2018). Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15: 20170387, in Google Scholar

Choi, S.W., Mak, T.S.-H., and O’Reilly, P.F. (2020). Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15: 2759–2772, in Google Scholar

Christophersen, I.E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., Lin, H., Arking, D.E., Smith, A.V., Albert, C.M., et al.. (2017). Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat. Genet. 49: 946–952, in Google Scholar

Cox, T. (2001). Gaucher’s disease—an exemplary monogenic disorder. QJM Int. J. Med. 94: 399–402, in Google Scholar

Davey, J.W., Hohenlohe, P.A., Etter, P.D., Boone, J.Q., Catchen, J.M., and Blaxter, M.L. (2011). Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12: 499–510, in Google Scholar

Dudbridge, F. (2013). Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9: e1003348, in Google Scholar

Gaudillo, J., Rodriguez, J.J.R., Nazareno, A., Baltazar, L.R., Vilela, J., Bulalacao, R., Domingo, M., and Albia, J. (2019). Machine learning approach to single nucleotide polymorphism-based asthma prediction. PloS One 14: e0225574, in Google Scholar

Ghafouri-Fard, S., Taheri, M., Omrani, M.D., Daaee, A., Mohammad-Rahimi, H., and Kazazi, H. (2019). Application of single-nucleotide polymorphisms in the diagnosis of autism spectrum disorders: a preliminary study with artificial neural networks. J. Mol. Neurosci. 68: 515–521, in Google Scholar

Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Ch’ang, L.-Y., Huang, W., Liu, B., Shen, Y., et al.. (2003). The international hapmap project. Nature 426: 789–796, in Google Scholar

Gola, D., Erdmann, J., Müller-Myhsok, B., Schunkert, H., and König, I.R. (2020). Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet. Epidemiol. 44: 125–138, in Google Scholar

Grillo, E., Rizzo, C.L., Bianciardi, L., Bizzarri, V., Baldassarri, M., Spiga, O., Furini, S., De Felice, C., Signorini, C., Leoncini, S., et al.. (2013). Revealing the complexity of a monogenic disease: Rett syndrome exome sequencing. PloS One 8: e56599, in Google Scholar

Halperin, E. and Stephan, D.A. (2009). Snp imputation in association studies. Nat. Biotechnol. 27: 349–351, in Google Scholar

Ho, D.S.W., Schierding, W., Wake, M., Saffery, R., and O’Sullivan, J. (2019). Machine learning snp based prediction for precision medicine. Front. Genet. 10: 267, in Google Scholar

Hopfner, F., Mueller, S.H., Szymczak, S., Junge, O., Tittmann, L., May, S., Lohmann, K., Grallert, H., Lieb, W., Strauch, K., et al.. (2020). Rare variants in specific lysosomal genes are associated with Parkinson’s disease. Mov. Disord. 35: 1245–1248, in Google Scholar

Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44: 955–959, in Google Scholar

Kastelein, J.J., Reeskamp, L.F., and Hovingh, G.K. (2020). Familial hypercholesterolemia: The most common monogenic disorder in humans. J. Am. Coll. Cardiol. 75: 2567–2569, in Google Scholar

Kruppa, J., Ziegler, A., and König, I.R. (2012). Risk estimation and risk prediction using machine-learning methods. Hum. Genet. 131: 1639–1654, in Google Scholar

Levine, M.E., Langfelder, P., and Horvath, S. (2017). A weighted snp correlation network method for estimating polygenic risk scores. In: Biological networks and pathway analysis. Springer, New York, U.S., pp. 277–290.Search in Google Scholar

Lewis, C.M. and Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12: 1–11, in Google Scholar

Liu, X., Li, Y.I., and Pritchard, J.K. (2019). Trans effects on gene expression can drive omnigenic inheritance. Cell 177: 1022–1034.e6, in Google Scholar

López Ibáñez, B., Vinas, R., Torrent-Fontbona, F., and Fernández-Real Lemos, J.M. (2016). Handling missing phenotype data with random forests for diabetes risk prognosis. In: 1st ECAIWorkshop on artificial intelligence for diabetes. European Conference on Artificial Intelligence (ECAI). Zenodo, The Hage, Netherlands, pp. 39–42.Search in Google Scholar

López, B., Torrent-Fontbona, F., Viñas, R., and Fernández-Real, J.M. (2018). Single nucleotide polymorphism relevance learning with random forests for type 2 diabetes risk prediction. Artif. Intell. Med. 85: 43–49, in Google Scholar

Lundberg, S.M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In: Advances in neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, pp. 4765–4774.Search in Google Scholar

Machiela, M.J. and Chanock, S.J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31: 3555–3557, in Google Scholar

Marchini, J. and Howie, B. (2010). Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11: 499–511, in Google Scholar

Mayo, O. (2008). A century of Hardy–Weinberg equilibrium. Twin Res. Hum. Genet. 11: 249–256, in Google Scholar

Mieth, B., Kloft, M., Rodríguez, J.A., Sonnenburg, S., Vobruba, R., Morcillo-Suárez, C., Farré, X., Marigorta, U.M., Fehr, E., Dickhaus, T., et al.. (2016). Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 6: 36671, in Google Scholar

Mieth, B., Rozier, A., Rodriguez, J.A., Hohne, M.M.-C., Gornitz, N., and Muller, K.R. (2020). DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies, bioRxiv.Search in Google Scholar

Montanez, C.A.C., Fergus, P., Montaez, A.C., Hussain, A., Al-Jumeily, D., and Chalmers, C. (2018). Deep learning classification of polygenic obesity using genome wide association study snps. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, New York, U.S. ,pp. 1–8.Search in Google Scholar

Nicolae, D.L. (2006). Testing untyped alleles (tuna)—applications to genome-wide association studies. Genet. Epidemiol. 30: 718–727, in Google Scholar

Okser, S., Pahikkala, T., Airola, A., Salakoski, T., Ripatti, S., and Aittokallio, T. (2014). Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10: e1004754, in Google Scholar

Oriol, J.D.V., Vallejo, E.E., Estrada, K., Peña, J.G.T., and Initiative, A.D.N. (2019). Benchmarking machine learning models for late-onset Alzheimer’s disease prediction from genomic data. BMC Bioinf. 20: 1–17, in Google Scholar

Orlenko, A. and Moore, J.H. (2021). A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min. 14: 1–17, in Google Scholar

Paré, G., Mao, S., and Deng, W.Q. (2017). A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci. Rep. 7: 1–11, in Google Scholar

Pers, T.H., Karjalainen, J.M., Chan, Y., Westra, H.-J., Wood, A.R., Yang, J., Lui, J.C., Vedantam, S., Gustafsson, S., Esko, T., et al.. (2015). Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6: 1–9, in Google Scholar

Pirmoradi, S., Teshnehlab, M., Zarghami, N., and Sharifi, A. (2020). A self-organizing deep auto-encoder approach for classification of complex diseases using snp genomics data. Appl. Soft Comput. 97: 106718, in Google Scholar

Privé, F., Arbel, J., and Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics 36: 5424–5431, in Google Scholar

Ribeiro, M.T., Singh, S., and Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp. 1135–1144.Search in Google Scholar

Romagnoni, A., Jégou, S., Van Steen, K., Wainrib, G., and Hugot, J.-P. (2019). Comparative performances of machine learning methods for classifying crohn disease patients using genome-wide genotyping data. Sci. Rep. 9: 1–18, in Google Scholar

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1: 206–215, in Google Scholar

Saeys, Y., Abeel, T., and Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Heidelberg, Berlin, pp. 313–325.Search in Google Scholar

Schote, A.B., Schiel, F., Schmitt, B., Winnikes, U., Frank, N., Gross, K., Croyé, M.-A., Tarragon, E., Bekhit, A., Bobbili, D.R., et al.. (2020). Genome-wide linkage analysis of families with primary hyperhidrosis. PloS One 15: e0244565, in Google Scholar

Seifert, C., Scherzinger, S., and Wiese, L. (2019). Towards generating consumer labels for machine learning models. In: 2019 IEEE first International Conference on Cognitive Machine Intelligence (CogMI). IEEE, Los Angeles, USA, pp. 173–179.Search in Google Scholar

Shaik Mohammad, N., Sai Shruti, P., Bharathi, V., Krishna Prasad, C., Hussain, T., Alrokayan, S.A., Naik, U., and Radha Rama Devi, A. (2016). Clinical utility of folate pathway genetic polymorphisms in the diagnosis of autism spectrum disorders. Psychiatr. Genet. 26: 281–286, in Google Scholar

Shi, H., Kichaev, G., and Pasaniuc, B. (2016). Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 99: 139–153, in Google Scholar

Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences, arXiv preprint arXiv:1704.02685.Search in Google Scholar

Slatkin, M. (2008). Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9: 477–485, in Google Scholar

Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al.. (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12: 1–10, in Google Scholar

Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Stat. Med. 39: 4605–4620, in Google Scholar

Sun, Y.V. and Kardia, S.L. (2008). Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks. Eur. J. Hum. Genet. 16: 487–495, in Google Scholar

Torkamani, A., Wineinger, N.E., and Topol, E.J. (2018). The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19: 581–590, in Google Scholar

Vilhjálmsson, B.J., Yang, J., Finucane, H.K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., Loh, P.-R., Bhatia, G., Do, R., et al.. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97: 576–592, in Google Scholar

Wang, H.-Y., Chang, S.-C., Lin, W.-Y., Chen, C.-H., Chiang, S.-H., Huang, K.-Y., Chu, B.-Y., Lu, J.-J., and Lee, T.-Y. (2018). Machine learning-based method for obesity risk evaluation using single-nucleotide polymorphisms derived from next-generation sequencing. J. Comput. Biol. 25: 1347–1360, in Google Scholar

Wei, Z., Wang, W., Bradfield, J., Li, J., Cardinale, C., Frackelton, E., Kim, C., Mentch, F., Van Steen, K., Visscher, P.M., et al.. (2013). Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am. J. Hum. Genet. 92: 1008–1012, in Google Scholar

White, M.J., Yaspan, B.L., Veatch, O.J., Goddard, P., Risse-Adams, O.S., and Contreras, M.G. (2019). Strategies for pathway analysis using GWAS and WGS data. Curr. Protoc. Hum. Genet. 100: e79, in Google Scholar

Wray, N.R., Lin, T., Austin, J., McGrath, J.J., Hickie, I.B., Murray, G.K., and Visscher, P.M. (2021). From basic science to clinical application of polygenic risk scores: a primer. JAMA Psychiatry. 78: 101–109, in Google Scholar

Xu, Y., Cao, L., Zhao, X., Yao, Y., Liu, Q., Zhang, B., Wang, Y., Mao, Y., Ma, Y., Ma, J.Z., et al.. (2020). Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Front. Psychiatr. 11: 416, in Google Scholar

Yin, B., Balvert, M., van der Spek, R.A., Dutilh, B.E., Bohte, S., Veldink, J., and Schönhuth, A. (2019). Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. Bioinformatics 35: i538–i547, in Google Scholar

Zhang, C., Dong, S.-S., Xu, J.-Y., He, W.-M., and Yang, T.-L. (2019). PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35: 1786–1788, in Google Scholar

Received: 2021-01-10
Accepted: 2021-06-15
Published Online: 2021-07-05
Published in Print: 2021-07-27

© 2021 Walter de Gruyter GmbH, Berlin/Boston