Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 5, 2021

Machine learning based disease prediction from genotype data

Nikoletta Katsaouni ORCID logo, Araek Tashkandi, Lena Wiese and Marcel H. Schulz
From the journal Biological Chemistry

Abstract

Using results from genome-wide association studies for understanding complex traits is a current challenge. Here we review how genotype data can be used with different machine learning (ML) methods to predict phenotype occurrence and severity from genotype data. We discuss common feature encoding schemes and how studies handle the often small number of samples compared to the huge number of variants. We compare which ML methods are being applied, including recent results using deep neural networks. Further, we review the application of methods for feature explanation and interpretation.


Corresponding author: Marcel H. Schulz, Institute for Cardiovascular Regeneration, Goethe University, 60590Frankfurt am Main, Germany; German Center for Cardiovascular Research (DZHK), Partner Site RheinMain, 60590Frankfurt am Main, Germany; and Cardio-Pulmonary Institute, Goethe University Hospital, Frankfurt am Main, Germany, E-mail:

Funding source: DFG Cluster of Excellence Cardio Pulmonary Institute (CPI)

Award Identifier / Grant number: EXC 2026

Funding source: Alfons und Gertrud Kassel-Stiftung "Center for Data Science and AI"

  1. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This project is part of the "Center for Data Science and AI" funded by the Alfons und Gertrud Kassel-Stiftung. This work was supported by the DFG Cluster of Excellence Cardio Pulmonary Institute (CPI) [EXC 2026].

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

Aguiar-Pulido, V., Seoane, J.A., Rabuñal, J.R., Dorado, J., Pazos, A., and Munteanu, C.R. (2010). Machine learning techniques for single nucleotide polymorphism–disease classification models in schizophrenia. Molecules 15: 4875–4889, https://doi.org/10.3390/molecules15074875.Search in Google Scholar

Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., and Zondervan, K.T. (2010). Data quality control in genetic case-control association studies. Nat. Protoc. 5: 1564–1573, https://doi.org/10.1038/nprot.2010.116.Search in Google Scholar

Ani, A., van der Most, P.J., Snieder, H., Vaez, A., and Nolte, I.M. (2021). Gwasinspector: comprehensive quality control of genome-wide association study results. Bioinformatics 37: 129–130, https://doi.org/10.1093/bioinformatics/btaa1084.Search in Google Scholar

Badré, A., Zhang, L., Muchero, W., Reynolds, J.C., and Pan, C. (2020). Deep neural network improves the estimation of polygenic risk scores for breast cancer. J. Hum. Genet. 66: 1–11, https://doi.org/10.1038/s10038-020-00832-7.Search in Google Scholar

Baumgarten, N., Hecker, D., Karunanithi, S., Schmidt, F., List, M., and Schulz, M.H. (2020). EpiRegio: analysis and retrieval of regulatory elements linked to genes. Nucleic Acids Res. 48: W193–W199, https://doi.org/10.1093/nar/gkaa382.Search in Google Scholar

Bellenguez, C., Charbonnier, C., Grenier-Boley, B., Quenez, O., Le Guennec, K., Nicolas, G., Chauhan, G., Wallon, D., Rousseau, S., Richard, A.C., et al.. (2017). Contribution to Alzheimer’s disease risk of rare variants in trem2, sorl1, and abca7 in 1779 cases and 1273 controls. Neurobiol. Aging 59: 220–e1, https://doi.org/10.1016/j.neurobiolaging.2017.07.001.Search in Google Scholar

Bellot, P., de Los Campos, G., and Pérez-Enciso, M. (2018). Can deep learning improve genomic prediction of complex human traits? Genetics 210: 809–819, https://doi.org/10.1534/genetics.118.301298.Search in Google Scholar

Boyle, A.P., Hong, E.L., Hariharan, M., Cheng, Y., Schaub, M.A., Kasowski, M., Karczewski, K.J., Park, J., Hitz, B.C., Weng, S., et al.. (2012). Annotation of functional variation in personal genomes using regulomedb. Genome Res. 22: 1790–1797, https://doi.org/10.1101/gr.137323.112.Search in Google Scholar

Boyle, E.A., Li, Y.I., and Pritchard, J.K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169: 1177–1186, https://doi.org/10.1016/j.cell.2017.05.038.Search in Google Scholar

Bracher-Smith, M., Crawford, K., and Escott-Price, V. (2020). Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol. Psychiatr. 26: 1–10, https://doi.org/10.1038/s41380-020-0825-2.Search in Google Scholar

Browning, B.L. and Browning, S.R. (2009). A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84: 210–223, https://doi.org/10.1016/j.ajhg.2009.01.005.Search in Google Scholar

Buniello, A., MacArthur, J.A., Cerezo, M., Harris, L.W., Hayhurst, J., Malangone, C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al.. (2018). The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47: D1005–D1012, https://doi.org/10.1093/nar/gky1120.Search in Google Scholar

Chen, I.Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M. (2020). Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, https://doi.org/10.1146/annurev-biodatasci-092820-114757.Search in Google Scholar

Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way, G.P., Ferrero, E., Agapow, P.-M., Zietz, M., Hoffman, M.M., et al.. (2018). Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15: 20170387, https://doi.org/10.1098/rsif.2017.0387.Search in Google Scholar

Choi, S.W., Mak, T.S.-H., and O’Reilly, P.F. (2020). Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15: 2759–2772, https://doi.org/10.1038/s41596-020-0353-1.Search in Google Scholar

Christophersen, I.E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., Lin, H., Arking, D.E., Smith, A.V., Albert, C.M., et al.. (2017). Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat. Genet. 49: 946–952, https://doi.org/10.1038/ng.3843.Search in Google Scholar

Cox, T. (2001). Gaucher’s disease—an exemplary monogenic disorder. QJM Int. J. Med. 94: 399–402, https://doi.org/10.1093/qjmed/94.8.399.Search in Google Scholar

Davey, J.W., Hohenlohe, P.A., Etter, P.D., Boone, J.Q., Catchen, J.M., and Blaxter, M.L. (2011). Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12: 499–510, https://doi.org/10.1038/nrg3012.Search in Google Scholar

Dudbridge, F. (2013). Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9: e1003348, https://doi.org/10.1371/journal.pgen.1003348.Search in Google Scholar

Gaudillo, J., Rodriguez, J.J.R., Nazareno, A., Baltazar, L.R., Vilela, J., Bulalacao, R., Domingo, M., and Albia, J. (2019). Machine learning approach to single nucleotide polymorphism-based asthma prediction. PloS One 14: e0225574, https://doi.org/10.1371/journal.pone.0225574.Search in Google Scholar

Ghafouri-Fard, S., Taheri, M., Omrani, M.D., Daaee, A., Mohammad-Rahimi, H., and Kazazi, H. (2019). Application of single-nucleotide polymorphisms in the diagnosis of autism spectrum disorders: a preliminary study with artificial neural networks. J. Mol. Neurosci. 68: 515–521, https://doi.org/10.1007/s12031-019-01311-1.Search in Google Scholar

Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Ch’ang, L.-Y., Huang, W., Liu, B., Shen, Y., et al.. (2003). The international hapmap project. Nature 426: 789–796, https://doi.org/10.1038/nature02168.Search in Google Scholar

Gola, D., Erdmann, J., Müller-Myhsok, B., Schunkert, H., and König, I.R. (2020). Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet. Epidemiol. 44: 125–138, https://doi.org/10.1002/gepi.22279.Search in Google Scholar

Grillo, E., Rizzo, C.L., Bianciardi, L., Bizzarri, V., Baldassarri, M., Spiga, O., Furini, S., De Felice, C., Signorini, C., Leoncini, S., et al.. (2013). Revealing the complexity of a monogenic disease: Rett syndrome exome sequencing. PloS One 8: e56599, https://doi.org/10.1371/journal.pone.0056599.Search in Google Scholar

Halperin, E. and Stephan, D.A. (2009). Snp imputation in association studies. Nat. Biotechnol. 27: 349–351, https://doi.org/10.1038/nbt0409-349.Search in Google Scholar

Ho, D.S.W., Schierding, W., Wake, M., Saffery, R., and O’Sullivan, J. (2019). Machine learning snp based prediction for precision medicine. Front. Genet. 10: 267, https://doi.org/10.3389/fgene.2019.00267.Search in Google Scholar

Hopfner, F., Mueller, S.H., Szymczak, S., Junge, O., Tittmann, L., May, S., Lohmann, K., Grallert, H., Lieb, W., Strauch, K., et al.. (2020). Rare variants in specific lysosomal genes are associated with Parkinson’s disease. Mov. Disord. 35: 1245–1248, https://doi.org/10.1002/mds.28037.Search in Google Scholar

Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44: 955–959, https://doi.org/10.1038/ng.2354.Search in Google Scholar

Kastelein, J.J., Reeskamp, L.F., and Hovingh, G.K. (2020). Familial hypercholesterolemia: The most common monogenic disorder in humans. J. Am. Coll. Cardiol. 75: 2567–2569, https://doi.org/10.1016/j.jacc.2020.03.058.Search in Google Scholar

Kruppa, J., Ziegler, A., and König, I.R. (2012). Risk estimation and risk prediction using machine-learning methods. Hum. Genet. 131: 1639–1654, https://doi.org/10.1007/s00439-012-1194-y.Search in Google Scholar

Levine, M.E., Langfelder, P., and Horvath, S. (2017). A weighted snp correlation network method for estimating polygenic risk scores. In: Biological networks and pathway analysis. Springer, New York, U.S., pp. 277–290.Search in Google Scholar

Lewis, C.M. and Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12: 1–11, https://doi.org/10.1186/s13073-020-00742-5.Search in Google Scholar

Liu, X., Li, Y.I., and Pritchard, J.K. (2019). Trans effects on gene expression can drive omnigenic inheritance. Cell 177: 1022–1034.e6, https://doi.org/10.1016/j.cell.2019.04.014.Search in Google Scholar

López Ibáñez, B., Vinas, R., Torrent-Fontbona, F., and Fernández-Real Lemos, J.M. (2016). Handling missing phenotype data with random forests for diabetes risk prognosis. In: 1st ECAIWorkshop on artificial intelligence for diabetes. European Conference on Artificial Intelligence (ECAI). Zenodo, The Hage, Netherlands, pp. 39–42.Search in Google Scholar

López, B., Torrent-Fontbona, F., Viñas, R., and Fernández-Real, J.M. (2018). Single nucleotide polymorphism relevance learning with random forests for type 2 diabetes risk prediction. Artif. Intell. Med. 85: 43–49, https://doi.org/10.1016/j.artmed.2017.09.005.Search in Google Scholar

Lundberg, S.M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In: Advances in neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, pp. 4765–4774.Search in Google Scholar

Machiela, M.J. and Chanock, S.J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31: 3555–3557, https://doi.org/10.1093/bioinformatics/btv402.Search in Google Scholar

Marchini, J. and Howie, B. (2010). Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11: 499–511, https://doi.org/10.1038/nrg2796.Search in Google Scholar

Mayo, O. (2008). A century of Hardy–Weinberg equilibrium. Twin Res. Hum. Genet. 11: 249–256, https://doi.org/10.1375/twin.11.3.249.Search in Google Scholar

Mieth, B., Kloft, M., Rodríguez, J.A., Sonnenburg, S., Vobruba, R., Morcillo-Suárez, C., Farré, X., Marigorta, U.M., Fehr, E., Dickhaus, T., et al.. (2016). Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 6: 36671, https://doi.org/10.1038/srep36671.Search in Google Scholar

Mieth, B., Rozier, A., Rodriguez, J.A., Hohne, M.M.-C., Gornitz, N., and Muller, K.R. (2020). DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies, bioRxiv.Search in Google Scholar

Montanez, C.A.C., Fergus, P., Montaez, A.C., Hussain, A., Al-Jumeily, D., and Chalmers, C. (2018). Deep learning classification of polygenic obesity using genome wide association study snps. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, New York, U.S. ,pp. 1–8.Search in Google Scholar

Nicolae, D.L. (2006). Testing untyped alleles (tuna)—applications to genome-wide association studies. Genet. Epidemiol. 30: 718–727, https://doi.org/10.1002/gepi.20182.Search in Google Scholar

Okser, S., Pahikkala, T., Airola, A., Salakoski, T., Ripatti, S., and Aittokallio, T. (2014). Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10: e1004754, https://doi.org/10.1371/journal.pgen.1004754.Search in Google Scholar

Oriol, J.D.V., Vallejo, E.E., Estrada, K., Peña, J.G.T., and Initiative, A.D.N. (2019). Benchmarking machine learning models for late-onset Alzheimer’s disease prediction from genomic data. BMC Bioinf. 20: 1–17, https://doi.org/10.1186/s12859-019-3158-x.Search in Google Scholar

Orlenko, A. and Moore, J.H. (2021). A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min. 14: 1–17, https://doi.org/10.1186/s13040-021-00243-0.Search in Google Scholar

Paré, G., Mao, S., and Deng, W.Q. (2017). A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci. Rep. 7: 1–11, https://doi.org/10.1038/s41598-017-13056-1.Search in Google Scholar

Pers, T.H., Karjalainen, J.M., Chan, Y., Westra, H.-J., Wood, A.R., Yang, J., Lui, J.C., Vedantam, S., Gustafsson, S., Esko, T., et al.. (2015). Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6: 1–9, https://doi.org/10.1038/ncomms6890.Search in Google Scholar

Pirmoradi, S., Teshnehlab, M., Zarghami, N., and Sharifi, A. (2020). A self-organizing deep auto-encoder approach for classification of complex diseases using snp genomics data. Appl. Soft Comput. 97: 106718, https://doi.org/10.1016/j.asoc.2020.106718.Search in Google Scholar

Privé, F., Arbel, J., and Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics 36: 5424–5431, https://doi.org/10.1093/bioinformatics/btaa1029.Search in Google Scholar

Ribeiro, M.T., Singh, S., and Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp. 1135–1144.Search in Google Scholar

Romagnoni, A., Jégou, S., Van Steen, K., Wainrib, G., and Hugot, J.-P. (2019). Comparative performances of machine learning methods for classifying crohn disease patients using genome-wide genotyping data. Sci. Rep. 9: 1–18, https://doi.org/10.1038/s41598-019-46649-z.Search in Google Scholar

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1: 206–215, https://doi.org/10.1038/s42256-019-0048-x.Search in Google Scholar

Saeys, Y., Abeel, T., and Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Heidelberg, Berlin, pp. 313–325.Search in Google Scholar

Schote, A.B., Schiel, F., Schmitt, B., Winnikes, U., Frank, N., Gross, K., Croyé, M.-A., Tarragon, E., Bekhit, A., Bobbili, D.R., et al.. (2020). Genome-wide linkage analysis of families with primary hyperhidrosis. PloS One 15: e0244565, https://doi.org/10.1371/journal.pone.0244565.Search in Google Scholar

Seifert, C., Scherzinger, S., and Wiese, L. (2019). Towards generating consumer labels for machine learning models. In: 2019 IEEE first International Conference on Cognitive Machine Intelligence (CogMI). IEEE, Los Angeles, USA, pp. 173–179.Search in Google Scholar

Shaik Mohammad, N., Sai Shruti, P., Bharathi, V., Krishna Prasad, C., Hussain, T., Alrokayan, S.A., Naik, U., and Radha Rama Devi, A. (2016). Clinical utility of folate pathway genetic polymorphisms in the diagnosis of autism spectrum disorders. Psychiatr. Genet. 26: 281–286, https://doi.org/10.1097/ypg.0000000000000152.Search in Google Scholar

Shi, H., Kichaev, G., and Pasaniuc, B. (2016). Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 99: 139–153, https://doi.org/10.1016/j.ajhg.2016.05.013.Search in Google Scholar

Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences, arXiv preprint arXiv:1704.02685.Search in Google Scholar

Slatkin, M. (2008). Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9: 477–485, https://doi.org/10.1038/nrg2361.Search in Google Scholar

Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al.. (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12: 1–10, https://doi.org/10.1371/journal.pmed.1001779.Search in Google Scholar

Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Stat. Med. 39: 4605–4620, https://doi.org/10.1002/sim.8743.Search in Google Scholar

Sun, Y.V. and Kardia, S.L. (2008). Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks. Eur. J. Hum. Genet. 16: 487–495, https://doi.org/10.1038/sj.ejhg.5201988.Search in Google Scholar

Torkamani, A., Wineinger, N.E., and Topol, E.J. (2018). The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19: 581–590, https://doi.org/10.1038/s41576-018-0018-x.Search in Google Scholar

Vilhjálmsson, B.J., Yang, J., Finucane, H.K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., Loh, P.-R., Bhatia, G., Do, R., et al.. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97: 576–592, https://doi.org/10.1016/j.ajhg.2015.09.001.Search in Google Scholar

Wang, H.-Y., Chang, S.-C., Lin, W.-Y., Chen, C.-H., Chiang, S.-H., Huang, K.-Y., Chu, B.-Y., Lu, J.-J., and Lee, T.-Y. (2018). Machine learning-based method for obesity risk evaluation using single-nucleotide polymorphisms derived from next-generation sequencing. J. Comput. Biol. 25: 1347–1360, https://doi.org/10.1089/cmb.2018.0002.Search in Google Scholar

Wei, Z., Wang, W., Bradfield, J., Li, J., Cardinale, C., Frackelton, E., Kim, C., Mentch, F., Van Steen, K., Visscher, P.M., et al.. (2013). Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am. J. Hum. Genet. 92: 1008–1012, https://doi.org/10.1016/j.ajhg.2013.05.002.Search in Google Scholar

White, M.J., Yaspan, B.L., Veatch, O.J., Goddard, P., Risse-Adams, O.S., and Contreras, M.G. (2019). Strategies for pathway analysis using GWAS and WGS data. Curr. Protoc. Hum. Genet. 100: e79, https://doi.org/10.1002/cphg.79.Search in Google Scholar

Wray, N.R., Lin, T., Austin, J., McGrath, J.J., Hickie, I.B., Murray, G.K., and Visscher, P.M. (2021). From basic science to clinical application of polygenic risk scores: a primer. JAMA Psychiatry. 78: 101–109, https://doi.org/10.1001/jamapsychiatry.2020.3049.Search in Google Scholar

Xu, Y., Cao, L., Zhao, X., Yao, Y., Liu, Q., Zhang, B., Wang, Y., Mao, Y., Ma, Y., Ma, J.Z., et al.. (2020). Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Front. Psychiatr. 11: 416, https://doi.org/10.3389/fpsyt.2020.00416.Search in Google Scholar

Yin, B., Balvert, M., van der Spek, R.A., Dutilh, B.E., Bohte, S., Veldink, J., and Schönhuth, A. (2019). Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. Bioinformatics 35: i538–i547, https://doi.org/10.1093/bioinformatics/btz369.Search in Google Scholar

Zhang, C., Dong, S.-S., Xu, J.-Y., He, W.-M., and Yang, T.-L. (2019). PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35: 1786–1788, https://doi.org/10.1093/bioinformatics/bty875.Search in Google Scholar

Received: 2021-01-10
Accepted: 2021-06-15
Published Online: 2021-07-05
Published in Print: 2021-07-27

© 2021 Walter de Gruyter GmbH, Berlin/Boston